We all know that Principal Component Analysis is executed on a Covariance/Correlation matrix, but what if we have a very high dimensional data, assuming 75 features and 157849 rows?
How does PCA tackle this?
- Does it tackle this problem in the same way as it does for
correlated datasets? - Will my explained variance be equally
distributed among the 75 features? - I came across BARTLETT'S Test and
KMO Test which helps us:- in identifying the wether there is any
correlation present or not, and - the proportion of variance that might
be a common variance among the variables
- in identifying the wether there is any
respectively. I can certainly leverage these two tests in making a controlled decision, but I am still looking for an answer towards:
- How does PCA behave when there is no correlation in the dataset?
I want to get an interpretation of this in a way that I could explain it to my non-technical brother.
Practical example using Python:
s = pd.Series(data=[1,1,1],index=['a','b','c']) diag_data = np.diag(s) df = pd.DataFrame(diag_data, index=s.index, columns=s.index) # Normalizing df = (df.subtract(df.mean())).divide(df.std())
Which looks like:
a b c a 1.154701 -0.577350 -0.577350 b -0.577350 1.154701 -0.577350 c -0.577350 -0.577350 1.154701
Covariance Matrix looks like this:
Cor = np.corrcoef(df.T) Cor array([[ 1. , -0.5, -0.5], [-0.5, 1. , -0.5], [-0.5, -0.5, 1. ]])
Now, calculating PCA Projections:
eigen_vals,eigen_vects = np.linalg.eig(Cor) projections = pd.DataFrame(np.dot(df,eigen_vects))
And projections are:
0 1 2 0 1.414214 -2.012134e-17 -0.102484 1 -0.707107 -2.421659e-16 -1.170283 2 -0.707107 -1.989771e-16 1.272767
The explained Ratio seems to be equally distributed among two features:
[0.5000000000000001, -9.680089716721685e-17, 0.5000000000000001]
Now, when I tried calculating the Q-Residual error in order to find the reconstruction error, I got zero for all the features:
a 0.0 b 0.0 c 0.0 dtype: float64
This would indicate that PCA on a non-correlated dataset like identity matrix gives us the projections which are very close to the original data-points. And the same results are obtained with the DIAGONAL MATRIX.
If the reconstruction error is very low, this would suggest that, in a single pipeline, we can fix the PCA method to execute and even if the dataset is not carrying much correlation we will get the same results after PCA transformation, but for the dataset which has high correlated features, we can prevent our curse of dimensionality.
Public views on this?
Best Answer
If you have no observed correlation, then your covariance matrix is diagonal, and the PCA diagonalizes a matrix that is already diagonal (so it does nothing).
If you have no population correlation but observe small sample correlations due to sampling variability, then the PCA is diagonalizing a covariance matrix that is nearly diagonal, and the result will be a minimally different set of features from the PCA.
Similar Posts:
- Solved – How does PCA behave when there is no correlation in the dataset
- Solved – Marginal distribution of the diagonal of an inverse Wishart distributed matrix
- Solved – Using the covariance matrix to calculate correlations
- Solved – the difference between the anti-image covariance and the anti-image correlation
- Solved – How to express a correlation matrix in terms of a covariance matrix