Solved – How to estimate most important dimensions of the clusters after performing k-means

I need to cluster customers of retail shops based on the products that they purchased. Therefore, I need to obtain, as results, both the customers belonging to each cluster and in each cluster the products that mostly influence the specified cluster. For instance, in cluster A, among all products, the customers purchase meat, bread, milk, etc. I'm going to use the k-means clustering algorithm with Apache Spark Mllib.

How to estimate the most important features in each cluster after the application of the clustering algorithm?

To estimate point importance in each cluster, one way is to rank points within a cluster according to their distance from the centroid (in the case of K-means) or sample a data point according to its mixture distribution (in the case of GMM).

Alternatively, you can try Affinity Propagation (AP) clustering which finds exemplars (members of the input set that are representative of clusters). It's available in scikit learn and doesn't require the knowledge of the number of clusters.

Similar Posts:

Rate this post

Leave a Comment