First of all, I know that this question has been addressed a certain number of times, but I didn't find an answer concerning the clustering of variables, instead of observations.
Concretely, I am using the function
varclus from the package
Hmisc to perform variables clustering.
As an example, I want to perform a cluster analysis on the variables of the dataset
ionosphere (available in the package
My code is as follow :
> library(Hmisc) > library(dprep) > data(ionosphere) > iono_min_l_col <- ionosphere[-length(ionosphere)] > iono_mx_min_l_col <- data.matrix(iono_min_l_col) > iono_clus <- varclus(iono_mx_min_l_col) > plot(iono_clus)
plot() I get the following dendrogram :
However, I don't know in how many clusters I should group my variables.
I can used the following code to know in which cluster are my variables (e.g. V1 Cluster 1, V2 Cluster 3, etc…), but I don't know how to get the optimal number of clusters ? (i.e. k in the following code)
> groups <- cutree(varclus(iono_mx_min_l_col)$hclust, k= ***???***)
Does someone know how to get this optimal number, k ?
Thanks a lot !
Be careful when using correlation, it should be applied on stationary variables, otherwise you will measure spurious correlation. I can give you a detailed answer tomorrow if you don't mind, time to sleep now after a rough day at work… But basically, I suggest to use the idea of bootstrapping, i.e. you perturb your observations by sampling from them with replacement. You create $L$ of such perturbed dataset (somewhat similar copies) and then you apply your clustering for $K=2..N-1$ and evaluate for each $K$ how similar are the results (with an Adjusted Rand Index for example). The rationale is that if there were some real underlying clustering structure of correlation, it should be persistent to small statistical noise. Thus, the better score for the real number of clusters $K$. I use that in practice on stocks and other financial products. It works quite well for detecting the macro level.
For example, you can use something similar to the code below. I also evaluate the stability on the pure noise samples to check for bias in the evaluation. Then, you can plot the stability scores and the noise scores, and have an idea of the number of clusters you may have in your data.
#load data data,names,dates = read_data() #create the bootstrap samples obj_samples = create_bootsamples(data,names) #create the noise samples noise_samples = create_noise_samples(len(data),len(data)) nbClust_min = 2 nbClust_max = round(len(names)/3) nb_clusters = [i for i in range(nbClust_min,nbClust_max+1)] stability_scores = ; noise_scores =  for K in nb_clusters: print("nbCluster: "+str(K)+" / "+str(nbClust_max)) for cur_sample in obj_samples: cur_sample["K"] = K for cur_noise in noise_samples: cur_noise["K"] = K #data processing boot_labels = Parallel(n_jobs=-2,verbose=5) (delayed(compute_clustering)(cur_sample) for cur_sample in obj_samples) stability_scores.append(evaluate_stability(boot_labels)) #noise processing noise_labels = Parallel(n_jobs=-2,verbose=5) (delayed(compute_clustering)(cur_noise) for cur_noise in noise_samples) noise_scores.append(evaluate_stability(noise_labels))
And, I will get: which indicates me that there is a first level with 2 clusters, then 7 clusters, then it is approximately the same whatever the number of clusters. This makes sense knowning the data on which it was computing on 🙂
Nonetheless, be careful clustering may be quite slow to converge and it depends on
- the clustering methodology chosen,
- obviously, the correlation coefficient that you have chosen which converges more or less quickly,
the complexity of your data:
- how many real clusters are in your variables?
- is there a hierarchy of clusters?
- is the contrast between clusters strong?
To understand how it may impact your results, consider this simple experiments. Build a distance/correlation matrix which is 'neat'. Clusters are obvious and thus the matrix is block diagonal. Then, sample from this matrix (using a simple multivariate Gaussian if you want). Then, you can estimate how much data/realizations you may need to recover the underlying clusters. You may be surprised 🙂 Remember that here you have assumed a very-neat situation and you have stationary (even i.i.d.) observations from a Gaussian, so the ideal case that never happens in practice! You can complexify the model with respect to your understanding of the data.
For example, with a Single Linkage clustering, you may get:
If you are interested by that topic, you can have a look at this paper.
- Solved – Clustering variables with outliers
- Solved – Clustering variables with outliers
- Solved – How to identify the variables that contribute most to creating the partition between two clusters
- Solved – Noise and Outliers in DBSCAN
- Solved – Clustering given pairwise distances with unknown cluster number