I am trying to use R to do Kmeans clustering and as most people I ran into the challenge of determining when to finish. I have 10,000 items and potentially 10 times of that down the road. My goal is to create a series of clusters with minimal size (e.g. 50 items per cluster) OR reasonably similar items. In other words, I don't want any of my output clusters to be too small (even if the items are quite different from each other), but I also don't mind if the clusters are too big as long as the items are similar enough.
I imagine I can use some kind of divisive hierarchical approach. I can start by building a small number of clusters and examine each cluster to determine if it needs to be split into more clusters. I can keep doing this till all clusters meet my stopping criteria.
I wonder if anyone knows good information on how other people do this?
Best Answer
There is a whole family of hierarchical clustering which should suit your needs, as it creates a tree, where each level represents the bigger (more general) clusters. Analysis of this structure and some custom cutting will bring you to described solution.
In R you can check out this source http://cran.r-project.org/web/views/Cluster.html , where you will find some hierarchical clustering implementations.
The easiest approach would be to:
- run hierarchical clustering (any) and analyze the tree and select clusters generality which fits your constraints
- cluster with any existing method, and then prune the small clusters (remove them iteratively and assign each point to the nearest of the remaining clusters).
Similar Posts:
- Solved – Divide-and-conquer approach for hierarchical clustering
- Solved – Can Agglomerative Clustering (Hierarchical) form non-convex clusters
- Solved – Can Agglomerative Clustering (Hierarchical) form non-convex clusters
- Solved – Which hierarchical clustering algorithm
- Solved – Clustering not producing even clusters