I need to cluster customers of retail shops based on the products that they purchased. Therefore, I need to obtain, as results, both the customers belonging to each cluster and in each cluster the products that mostly influence the specified cluster. For instance, in cluster A, among all products, the customers purchase meat, bread, milk, etc. I'm going to use the k-means clustering algorithm with Apache Spark Mllib.
How to estimate the most important features in each cluster after the application of the clustering algorithm?
Best Answer
To estimate point importance in each cluster, one way is to rank points within a cluster according to their distance from the centroid (in the case of K-means) or sample a data point according to its mixture distribution (in the case of GMM).
Alternatively, you can try Affinity Propagation (AP) clustering which finds exemplars (members of the input set that are representative of clusters). It's available in scikit learn and doesn't require the knowledge of the number of clusters.
Similar Posts:
- Solved – How to estimate most important dimensions of the clusters after performing k-means
- Solved – K-means classifies 96% of the data in 1 cluster. Any suggestions to improve the results
- Solved – How should I classify stores based on the demographics of their customers
- Solved – Clustering time series when each object has multiple time series
- Solved – Best method to assign new customers to existing clusters after segmentation