I'm running a PCA over a data set of $N times p$ size ($Napprox 1000$ being the number of measurements and $papprox 200$ being the number of dimensions/predictors).
I expect many of the predictors to be correlated and that the dimensions can consequently be reduced. I can even drop some columns that are linearly dependent with respect to the others.
When I run the PCA I find that $sim 50%$ of the variance can be explained by the first 5 PCs, suggesting that the predictors can actually be grouped.
But I am concerned about the smallness of the correlation matrix ($R$) determinant, which is $det(R) approx 10^{-100}$ or a ridiculous number like that.
Do the results make sense with such a small number?
Moreover, I see that the PCA results change (a lot!) if I round the input numbers to drop non-relevant digits, like the 10th digit or so. I think this is linked with the fact we are working with such a small determinant.
Since a small determinant in R indicates that there are redundant dimensions, I would say that the PCA is the way to go to reduce them. Nevertheless, does it make sense to run a PCA with such a small determinant? If not, what is the best way to reduce the dimensionality of the problem?
Best Answer
Having a very small $ det(R) $ only means that you have some variables that are almost linearly dependent. Note that $det(R)$ equals the product of the eigenvalues of $R$; so there is at least one eigenvalue that is approximately zero.
This only means that you have some extra/redundant dimensions in your dataset and that PCA will actually be able to represent 100% of the information with a smaller ($p_text{new} le p – 1$) set of dimensions.