Solved – Is the matrix dimension important for performing a valid PCA

If $X$ is a $m × n$ matrix, where $m$ is the number of measurement types (variables) and $n$ is the number of samples, would it be correct to perform a PCA on a matrix that has $m geq n$ ? If not, please provide some arguments why would this be a problem.

I remember having heard that doing such an analysis would be invalid, but the Wikipedia page for PCA doesn't mention a low $n/m$ ratio as being a potential limitation for using the method.

Please note that I am a biologist and aim at a more practical answer (if possible).

PCA of variables. Number of observations n is low relative to number of variables. 1) Mathematical aspect. Whenever n<=m correlation matrix is singular which means some of last m principle components are zero-variance, that is, they are not existant. This is not a problem to PCA, generally speaking, since you could just ignore those. However, many software (mostly those uniting PCA and Factor Analysis in one command or procedure) will not allow you to have singular correlation matrix. 2) Statistical aspect. To have your results reliable you must have correlations reliable; that requires considerable sample size which always should be larger than number of variables. They say, if you have m=20 you ought to have n=100 or so. But if you have m=100 you should have n=300 or so. As m grows, minimal recommended n/m proportion diminishes.

Similar Posts:

Rate this post

Leave a Comment