In the k-means algorithm, what happens if two of the initially chosen centroids are extremely near to each other?
Say I have two centroids c1 and c2, and d(c1, c2) ~ 0, i.e. the distance between c1 and c2 is very near to 0. In such cases,how do I determine whether I should allot a point x to the centroid c1 or c2, seeing as x will be equidistant from both c1 and c2?
One approach I found is to calculate the median between c1 and c2, and allot all values lesser than the median to c1 and all values greater to the centroid to c2.
Are there any other solutions?
Tied distances are rather unusual. So even if the points are close, one will be nearer.
But lets look at what happens if we happened to draw two duplicate points as initial centers.
Then the distance of any point to these two will be the same.
Most implementations will assign all points to either the first or the last cluster. So what happens then? That center moves to the global data mean. The other one remains unchanged (usually: some implementations may fail on empty clusters, or decrease k, or draw a new center). At least two points (or initial centers) will be assigned to that mean on the next iteration, and the first center will thus move further away from these initial values. So we don't really have a problem here (well, some bad implementarions may have a division by zero …), it just takes a bit longer to converge.
A slight problem may arise if our data is symmetric. Assume we have data points
0 1 2 3 4 10 10 16 17 18 19 20
We would intuitively like our centers to be something symmetric like 4 and 16 if we use k=2 (the optimum solution probably is 2 and 15.x, so not symmetric).
But if we draw 10 and 10 as initial cluster centers, we will usually get a result where all points are assigned to the first cluster, mean 10, and no points in the second cluster (mea remains at previous value 10).