I am using code from Using BIC to estimate the number of k in KMEANS (answer by Prabhath Nanisetty) to find BIC values for K-means using different number of components. However, using iris dataset, I get following results:

`N_clusters BIC 1 -863.896405 2 -674.133038 3 -616.557809 4 -603.357368 5 -582.428798 6 -596.073710 7 -590.086212 8 -579.876476 9 -554.665433 `

This is shown in following plot:

The plot after standardization of data:

Is is normal to have negative values for BIC. Which is the best number of clusters by BIC here, especially considering that iris data set has 3 groups? Most negative value in above list is for 1 cluster only.

**Contents**hide

#### Best Answer

I also use the code from the link you provided.

First thing, it is normal to have negative values of BIC. As you are using `BIC = likelihood - penalty`

you want to find the highest value, which in your first image clearly we would pick `N_clusters = 8`

and in the second image `N_clusters = 9`

.

I get almost the same if I use the squared euclidean distance:

If I use the euclidean distance I get the expected results and this is the formula I've been using because I've made some tests and it seems correct.

The results using the appropriate euclidean distance gives me this plot:

And here we can obviously see that the appropriate number of clusters to pick is 3 (Setosa, Versicolor and Virginica).

One last note is that it doesn't make sense to set your minimum `n_clusters`

to 1, it should start with 2. I only started with 1 to make the plot look like yours.