Solved – For calculating the distance between different points, does it make sense to use all Principal Components

I have a data frame with about 500 observations and 8 variables that I'd like to run through PCA in order to try and reduce the number of variables to only those with the most variance.

From here, I want to find the [Euclidean] distance between each observation.

Here's my question: should I use every Principal Component to calculate the distances? Or should I just use (by the general rule of thumb) the Principal Components that describe, in total, about 90% of the variance (here, the first 6)?

Here's the importance of components (from R) if you're curious:

Importance of components:                           PC1    PC2    PC3    PC4    PC5     PC6     PC7     PC8 Standard deviation     1.4652 1.1997 1.0477 0.9630 0.9103 0.87524 0.75321 0.47645 Proportion of Variance 0.2683 0.1799 0.1372 0.1159 0.1036 0.09576 0.07092 0.02838 Cumulative Proportion  0.2683 0.4482 0.5855 0.7014 0.8050 0.90071 0.97162 1.00000 

Any ideas? I'd appreciate any insight.

If you want to get the exact Euclidean difference then you're going to have to use all principal components. Of course, the main advantage of using PCA is that you don't lose much in the approximation if you drop off the components with the low eigenvalues. So if you were to drop the components with low eigenvalues you'd still get an approximation to the exact Euclidean distance, and if the eigenvalues on those components were low it should be quite a good approximation.

Similar Posts:

Rate this post

Leave a Comment