I have a data frame with about 500 observations and 8 variables that I'd like to run through PCA in order to try and reduce the number of variables to only those with the most variance.
From here, I want to find the [Euclidean] distance between each observation.
Here's my question: should I use every Principal Component to calculate the distances? Or should I just use (by the general rule of thumb) the Principal Components that describe, in total, about 90% of the variance (here, the first 6)?
Here's the importance of components (from R) if you're curious:
Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 Standard deviation 1.4652 1.1997 1.0477 0.9630 0.9103 0.87524 0.75321 0.47645 Proportion of Variance 0.2683 0.1799 0.1372 0.1159 0.1036 0.09576 0.07092 0.02838 Cumulative Proportion 0.2683 0.4482 0.5855 0.7014 0.8050 0.90071 0.97162 1.00000
Any ideas? I'd appreciate any insight.
If you want to get the exact Euclidean difference then you're going to have to use all principal components. Of course, the main advantage of using PCA is that you don't lose much in the approximation if you drop off the components with the low eigenvalues. So if you were to drop the components with low eigenvalues you'd still get an approximation to the exact Euclidean distance, and if the eigenvalues on those components were low it should be quite a good approximation.
- Solved – What to do in PCA when one variable has similar values in several principal component eigenvectors
- Solved – proportion of variance explained in PCA?
- Solved – How does PCA represent all data with just a few principal components?
- Solved – How to factor analyze two binary variables only
- Solved – Is the first principal component the one with the largest eigenvalue and how to convert it to explained variance