In the *k*-means algorithm, what happens if two of the initially chosen centroids are extremely near to each other?

Say I have two centroids *c1* and *c2*, and d(*c1*, *c2*) ~ 0, i.e. the distance between *c1* and *c2* is very near to 0. In such cases,how do I determine whether I should allot a point *x* to the centroid *c1* or *c2*, seeing as *x* will be equidistant from both *c1* and *c2*?

One approach I found is to calculate the median between *c1* and *c2*, and allot all values lesser than the median to *c1* and all values greater to the centroid to *c2*.

Are there any other solutions?

**Contents**hide

#### Best Answer

Tied distances are rather unusual. So even if the points are close, one will be nearer.

But lets look at what happens if we happened to draw two duplicate points as initial centers.

Then the distance of any point to these two *will* be the same.

Most implementations will assign all points to either the first or the last cluster. So what happens then? That center moves to the global data mean. The other one remains unchanged (usually: some implementations may fail on empty clusters, or decrease k, or draw a new center). At least two points (or initial centers) will be assigned to that mean on the next iteration, and the first center will thus move further away from these initial values. So we don't really have a problem here (well, some bad implementarions may have a division by zero …), it just takes a bit longer to converge.

A slight problem may arise if our data is symmetric. Assume we have data points

` 0 1 2 3 4 10 10 16 17 18 19 20 `

We would intuitively like our centers to be something symmetric like 4 and 16 if we use k=2 (the optimum solution probably is 2 and 15.x, so not symmetric).

But if we draw 10 and 10 as initial cluster centers, we will usually get a result where all points are assigned to the first cluster, mean 10, and no points in the second cluster (mea remains at previous value 10).