Solved – Clustering data based on correlation

I have a dataset where each row represents a sample and each sample is described by its chemical composition. You can see the 10 first rows of the dataset in figure 1.

Image1 - dataset
Figure 1 – Each row represent a sample and each sample is decomposed into the 17 different chemical compounds and the total (all values are given in percentage)

First I found the correlation between the samples and made the correlation matrix shown in figure 2.

Correlation matrix

But what I really want to cluster the chemical compounds that are more likely to be found together in a sample.

You seem to look for cluster analysis. Cluster analysis groups data according to some distance measure and correlation may well be the basis for your distance measure(*). As you have not mentioned any rules of how well samples should correlate to be toghether in one group, hiearchical cluster analysis might be in order: It will reveal visually the structure of how many groups do form depending on how you set a cutoff.

(*) writes

Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance. The distance between two objects is 0 when they are perfectly correlated. Pearson’s correlation is quite sensitive to outliers. […]

If we want to identify clusters of observations with the same overall profiles regardless of their magnitudes, then we should go with correlation-based distance as a dissimilarity measure

Similar Posts:

Rate this post

Leave a Comment