I am trying to compare the outputs of k-means algorithm coded by me and the outputs of R's kmeans
. Since the objective of the algorithm is to minimize the total within cluster sum of squares (WCSS), I have to look at the withinss
output of kmeans
and my WCSS. What I'm not able to figure out is, how much of variation should I allow i.e. how close should these values be so I can say that they are close.
(I know that I can also verify using the centroids, but when the data is not that well clustered, the centroids can be very different but the total WCSS can be close, and hence I'm looking at WCSS.)
Best Answer
As k-means on multiple runs will find different local minima, they can pretty much vary arbitrarily much. On contrary, if two values are close but not identical, I'd consider it much more likely that there is some slight error in one of the two implementations.
If there are multiple local minima, multiple runs with different seedings should give you a number of candidates so there is a high chance of actually finding the same result.
But in the end, k-means is so simple, and such a crude heuristic, what good is it to compare two results? On many data sets it still pretty much a random partitioning; optimized for a local minimum but still meaningless.