Solved – PCA and variable contributions to first n dimensions

I am looking at this tutorial:
Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization

Especially the contributions of the variables to the first 2 dimensions:

# Contributions of variables to PC1 fviz_contrib(res.pca, choice = "var", axes = 1, top = 10) # Contributions of variables to PC2 fviz_contrib(res.pca, choice = "var", axes = 2, top = 10) 

Can this be used for feature selection – i.e. take all variables with values above red line?

enter image description here

I would not think so, if the first 2 dimensions do not describe much of the variance in the data. Is this correct. However, if the first 2 dimensions describe more than 80% of the variance maybe?

Thanks!

We have a dedicated thread for that very specific purpose: Using principal component analysis (PCA) for feature selection.

Just a few points regarding the interpretation of those visual displays, and some reflexions on the question at hand:

  • This graphical output is a visual aid to see which variable contribute the most to the definition of the principal component. If you have a "PCA" object constructed using FactoMineR::PCA, then variable contribution values are stored in the $var$contrib slot of your object. The contribution is a scaled version of the squared correlation between variables and component axes (or the cosine, from a geometrical point of view) — this is used to assess the quality of the representation of the variables of the principal component, and it is computed as $text{cos}(text{variable}, text{axis})^2 times 100$ / total $text{cos}^2$ of the component.

  • It might not always be relevant to select a subset of variables based on their contribution to each principal component. Sometimes, a single variable can drive the component (this is sometimes known as a size effect, and it might simply result from a single variable capturing most of the variance along the first principal axis — this would result in a very high loading for that variable, and very low loadings for the remaining ones); other times, the signal is driven by few variables in higher dimension (e.g., past the 10th component); finally, a variable might have a high weight on one component, yet also a weight that is above your threshold (10%) on another component: does that mean it is more "important" than those variables that only load on (or drive) a single component?

  • It will be hard to cope with highly correlated variables, yet one of the principled approach to feature selection is to get ride (sometimes simply as a side effect of the algorithm itself) of colinearity by selecting only one variable among the cluster of highly correlated variables.

  • Beware that any arbitrary cutoff (10% for variable contribution, or 80% for the total explained variance) should be motivated by pragmatic or computational arguments.

To sum up, this approach to selecting variables might work, when used in a single pass algorithm or as a recursive procedure, but it really depends on the dataset. If the objective is to perform feature selection on a mulitvariate dataset with a primary outcome, why not use techniques dedicated to this task (Lasso operator, Random Forest, Gradient Boosting Machines, and the like), since they generally rely on an objective loss function and provide a more interpretable measure of variable importance?

Similar Posts:

Rate this post

Leave a Comment