Solved – String clustering and centroid computation

I have a text file document containing a set of words strings that I want to cluster. I want to use the K-means algorithm. As a distance I'll be using the Levenshtein distance.


Usually if we have for example real number vectors we can just compute the mean of all the vectors and it will represent the centroid of the cluster.

What approach can I use to compute the centroid of a cluster containing strings?

I would strongly recommend using a network clustering algorithm rather than K-means. K-means is strongly tied to Euclidean distances (and variance minimisation), and what you have is something completely different. A number of well-known network algorithms are Louvain clustering, RNSC, Affinity Propagation Clustering, and MCL.

Strings are a type of data that I think maps well onto networks, as the pair-wise relation is not really based on measurement of traits but rather on direct comparison. (Another situation where network approaches make sense is in certain types of high-dimensional data). In bioinformatics sequence clustering with string similarity has been very successfully approached using network clustering algorithms. NOTE you need a similarity, not a distance. For more information on K-means and where it is applicable see e.g. Anony Mousse's answer here – Why only the mean value is used in (K-means) clustering method?. Anony Mousse has written about this many times, with great insight.

Similar Posts:

Rate this post

Leave a Comment