I have a text file document containing a set of words strings
that I want to cluster. I want to use the K-means
algorithm. As a distance I'll be using the Levenshtein distance
.
Question
Usually if we have for example real number vectors we can just compute the mean of all the vectors and it will represent the centroid of the cluster.
What approach can I use to compute the centroid of a cluster containing strings?
Best Answer
I would strongly recommend using a network clustering algorithm rather than K-means. K-means is strongly tied to Euclidean distances (and variance minimisation), and what you have is something completely different. A number of well-known network algorithms are Louvain clustering, RNSC, Affinity Propagation Clustering, and MCL.
Strings are a type of data that I think maps well onto networks, as the pair-wise relation is not really based on measurement of traits but rather on direct comparison. (Another situation where network approaches make sense is in certain types of high-dimensional data). In bioinformatics sequence clustering with string similarity has been very successfully approached using network clustering algorithms. NOTE you need a similarity, not a distance. For more information on K-means and where it is applicable see e.g. Anony Mousse's answer here – Why only the mean value is used in (K-means) clustering method?. Anony Mousse has written about this many times, with great insight.
Similar Posts:
- Solved – how to calculate the distance between cluster center and datapoint in K-means
- Solved – Assigning new observations into existing clusters made from a distance matrix
- Solved – Hybrid (K-means + Hierarchical ) clustering
- Solved – Calculating distances in SAS PROC CLUSTER method=centroid
- Solved – How do we calculate distance and centroid for vectors that contain negative values