If I construct a 2-D matrix composed entirely of random data, I would expect the PCA and SVD components to essentially explain nothing.
Instead, it seems like the the first SVD column appears to explain 75% of the data. How can this possibly be? What am I doing wrong?
Here is the plot:
Here is the R code:
set.seed(1) rm(list=ls()) m <- matrix(runif(10000,min=0,max=25), nrow=100,ncol=100) svd1 <- svd(m, LINPACK=T) par(mfrow=c(1,4)) image(t(m)[,nrow(m):1]) plot(svd1$d,cex.lab=2, xlab="SVD Column",ylab="Singluar Value",pch=19) percentVarianceExplained = svd1$d^2/sum(svd1$d^2) * 100 plot(percentVarianceExplained,ylim=c(0,100),cex.lab=2, xlab="SVD Column",ylab="Percent of variance explained",pch=19) cumulativeVarianceExplained = cumsum(svd1$d^2/sum(svd1$d^2)) * 100 plot(cumulativeVarianceExplained,ylim=c(0,100),cex.lab=2, xlab="SVD column",ylab="Cumulative percent of variance explained",pch=19)
Update
Thankyou @Aaron. The fix, as you noted, was to add scaling to the matrix so that the numbers are centered around 0 (i.e. the mean is 0).
m <- scale(m, scale=FALSE)
Here is the corrected image, showing for a matrix with random data, the first SVD column is close to 0, as expected.
Best Answer
The first PC is explaining that the variables are not centered around zero. Scaling first or centering your random variables around zero will have the result you expect. For example, either of these:
m <- matrix(runif(10000,min=0,max=25), nrow=100,ncol=100) m <- scale(m, scale=FALSE) m <- matrix(runif(10000,min=-25,max=25), nrow=100,ncol=100)
Similar Posts:
- Solved – Are all slope coefficients correlated with the intercept in multiple linear regression
- Solved – How to use SVD for dimensionality reduction
- Solved – meant by Low Bias and High Variance of the Model
- Solved – Use coefficients of thin plate regression splines in a clustering method
- Solved – How to visualize an enormous sparse contingency table