# Solved – How to profile, visualise and understand large number of groups/classes/clusters in data

I am working on clustering a medium-sized, high-dimensional data set (200k rows; 120 columns).
Once I have attempted (multiple) cluster solutions, I would like to profile my clusters and understand them.

Previously, I used to calculate descriptive statistics (mean, mode, median, standard deviation). I was trying to use Parallel Coordinates Plots but these don't help much with large number of variables.

I was wondering if there are some other ways for profiling and understanding clusters.

Contents

I understand that you're interested in visual approaches to cluster insight.

In running your descriptive stats, did you employ an index of the cluster value relative to the total sample value for that statistic? So, for the 120 features in your data, in total and by cluster, create a (k+1)x120 matrix, with k=# clusters, then simply divide the cluster values by the grand mean (median, whatever) for each feature, multiply by hundred and round off the decimals. The resulting index is like an IQ score where indices of 80 and less or 120+ are considered (un)representative of that cluster. Really simple but it's useful for quick and dirty insights.

Once you have the indices, you can create a heat map of the features that highlight the deviances. Here's a link to an introduction to heat mapping that is fairly clear:

http://www.fusioncharts.com/dev/chart-guide/heat-map-chart/introduction.html

Joint-space maps would provide a visualization of the clusters relative to a canonical discriminant function of the features. The canonical variates would summarize the features in a low-dimensional, component space while also producing average values for the clusters. By locating each feature in this new, coordinate space a cluster by feature proximity matrix can be created which would be easy to visualize. Here's a link to a paper which discusses approaches to mapping such as this. The key thing is that any dimension reduction method can be leveraged:

http://web.mit.edu/hauser/www/Papers/Alternative_Perceptual_Mapping_Techniques.pdf

Topologists have developed an approach to analysis and visualization of complex data, Extracting insights from the shape of complex data using topology. Here's their Nature paper as well as some R code that they've created:

http://www.nature.com/articles/srep01236#f1

http://arxiv.org/abs/1411.1830

Here's a link to an article that has a multitude of visuals for clusters:

http://shabal.in/visuals.html

Evaluating cluster quality can also provide useful insight. There are lots of approaches to this but here's a link to an article that proposes 4 information-theoretic metrics: purity, normalized mutual information, rand index and the F-measure:

http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html#fig:clustfg3

Hope these help.

Rate this post

# Solved – How to profile, visualise and understand large number of groups/classes/clusters in data

I am working on clustering a medium-sized, high-dimensional data set (200k rows; 120 columns).
Once I have attempted (multiple) cluster solutions, I would like to profile my clusters and understand them.

Previously, I used to calculate descriptive statistics (mean, mode, median, standard deviation). I was trying to use Parallel Coordinates Plots but these don't help much with large number of variables.

I was wondering if there are some other ways for profiling and understanding clusters.

I understand that you're interested in visual approaches to cluster insight.

In running your descriptive stats, did you employ an index of the cluster value relative to the total sample value for that statistic? So, for the 120 features in your data, in total and by cluster, create a (k+1)x120 matrix, with k=# clusters, then simply divide the cluster values by the grand mean (median, whatever) for each feature, multiply by hundred and round off the decimals. The resulting index is like an IQ score where indices of 80 and less or 120+ are considered (un)representative of that cluster. Really simple but it's useful for quick and dirty insights.

Once you have the indices, you can create a heat map of the features that highlight the deviances. Here's a link to an introduction to heat mapping that is fairly clear:

http://www.fusioncharts.com/dev/chart-guide/heat-map-chart/introduction.html

Joint-space maps would provide a visualization of the clusters relative to a canonical discriminant function of the features. The canonical variates would summarize the features in a low-dimensional, component space while also producing average values for the clusters. By locating each feature in this new, coordinate space a cluster by feature proximity matrix can be created which would be easy to visualize. Here's a link to a paper which discusses approaches to mapping such as this. The key thing is that any dimension reduction method can be leveraged:

http://web.mit.edu/hauser/www/Papers/Alternative_Perceptual_Mapping_Techniques.pdf

Topologists have developed an approach to analysis and visualization of complex data, Extracting insights from the shape of complex data using topology. Here's their Nature paper as well as some R code that they've created:

http://www.nature.com/articles/srep01236#f1

http://arxiv.org/abs/1411.1830

Here's a link to an article that has a multitude of visuals for clusters:

http://shabal.in/visuals.html

Evaluating cluster quality can also provide useful insight. There are lots of approaches to this but here's a link to an article that proposes 4 information-theoretic metrics: purity, normalized mutual information, rand index and the F-measure:

http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html#fig:clustfg3

Hope these help.

Rate this post