I'm having trouble understanding some of the formulas in this paper related to BIC calculation (Dan Pelleg and Andrew Moore, *X-means: Extending K-means with Efficient Estimation of the Number of Clusters*).

First the variance equation:

- R – number of points
- K – number of clusters
- $mu_i$ – centroid associated with ith point.
- $sigma^2 = frac{1}{R-K}sum_{i}(x_i – mu_{(i)})^2 $

The log likelyhood then uses this sigma.

Am I reading this right, they're using 1 covariance matrix for all clusters (see quote below, they are)? This makes no sense. If you have 5 clusters, each one is a Gaussian according to k-means algorithm. So wouldn't it make sense to compute covariance $sigma^2_i$ for each cluster and use that?

My second question is regarding number of parameters to use in the BIC score. The paper mentions

Number of free parameters $p_j$ is simply the sum of K-1 class

probabilities, M*K centroid coordinates, and one variance estimate.

How do you get the K-1 class probabilities? I could do # of points in class i / total number of points. But then it's K-1, which probability is left out of the sum?

P.S. If anyone has a nicer paper on estimating k using similar methods I'd like to read that as well. At this point I'm not too concerned with speed.

Thanks for your help.

**Contents**hide

#### Best Answer

Let the clusters be indexed by $j = 1, ldots, K$ with $K_j gt 0$ points in cluster $j$. Let $mu_j$ (*no* parentheses around the subscript) designate the mean of cluster $j$. Then, because by definition $mu_{(i)}$ is the mean of whichever cluster $x_i$ belongs to, we can group the terms in the summation by cluster:

$$eqalign{ sigma^2 &= frac{1}{R-K}sum_{i}(x_i – mu_{(i)})^2 \ &= frac{1}{R-K}sum_{j=1}^Ksum_{k=1}^{K_j}(x_k – mu_j)^2 \ &= frac{1}{R-K}sum_{j=1}^K K_j frac{1}{K_j}sum_{k=1}^{K_j}(x_k – mu_j)^2 \ &= frac{1}{R-K}sum_{j=1}^K K_j sigma_j^2 },$$

with $sigma_j^2$ being the variance within cluster $j$ (where we *must* use $K_j$ instead of $K_j-1$ in the denominators to handle singleton clusters). I believe this is what you were expecting.

### Similar Posts:

- Solved – what’s the implementation of SciKit-Learn K-Means for empty clusters
- Solved – Probability for selecting centroids – K-means++
- Solved – Assessing quality of clusters
- Solved – Understanding k-means unsupervised learning for features
- Solved – Understanding k-means unsupervised learning for features