I am using code from Using BIC to estimate the number of k in KMEANS (answer by Prabhath Nanisetty) to find BIC values for K-means using different number of components. However, using iris dataset, I get following results:
N_clusters BIC 1 -863.896405 2 -674.133038 3 -616.557809 4 -603.357368 5 -582.428798 6 -596.073710 7 -590.086212 8 -579.876476 9 -554.665433
This is shown in following plot:
The plot after standardization of data:
Is is normal to have negative values for BIC. Which is the best number of clusters by BIC here, especially considering that iris data set has 3 groups? Most negative value in above list is for 1 cluster only.
Best Answer
I also use the code from the link you provided.
First thing, it is normal to have negative values of BIC. As you are using BIC = likelihood - penalty
you want to find the highest value, which in your first image clearly we would pick N_clusters = 8
and in the second image N_clusters = 9
.
I get almost the same if I use the squared euclidean distance:
If I use the euclidean distance I get the expected results and this is the formula I've been using because I've made some tests and it seems correct.
The results using the appropriate euclidean distance gives me this plot:
And here we can obviously see that the appropriate number of clusters to pick is 3 (Setosa, Versicolor and Virginica).
One last note is that it doesn't make sense to set your minimum n_clusters
to 1, it should start with 2. I only started with 1 to make the plot look like yours.