# Solved – How does PCA behave when there is no correlation in the dataset

We all know that Principal Component Analysis is executed on a Covariance/Correlation matrix, but what if we have a very high dimensional data, assuming 75 features and 157849 rows?
How does PCA tackle this?

• Does it tackle this problem in the same way as it does for
correlated datasets?
• Will my explained variance be equally
distributed among the 75 features?
• I came across BARTLETT'S Test and
KMO Test
which helps us:
• in identifying the wether there is any
correlation present or not, and
• the proportion of variance that might
be a common variance among the variables

respectively. I can certainly leverage these two tests in making a controlled decision, but I am still looking for an answer towards:

• How does PCA behave when there is no correlation in the dataset?

I want to get an interpretation of this in a way that I could explain it to my non-technical brother.

Practical example using Python:

``s = pd.Series(data=[1,1,1],index=['a','b','c']) diag_data = np.diag(s)  df = pd.DataFrame(diag_data, index=s.index, columns=s.index) # Normalizing df = (df.subtract(df.mean())).divide(df.std()) ``

Which looks like:

``        a            b          c a   1.154701    -0.577350   -0.577350 b   -0.577350   1.154701    -0.577350 c   -0.577350   -0.577350   1.154701 ``

Covariance Matrix looks like this:

``Cor = np.corrcoef(df.T) Cor  array([[ 1. , -0.5, -0.5],        [-0.5,  1. , -0.5],        [-0.5, -0.5,  1. ]]) ``

Now, calculating PCA Projections:

``eigen_vals,eigen_vects = np.linalg.eig(Cor) projections = pd.DataFrame(np.dot(df,eigen_vects)) ``

And projections are:

``        0             1             2 0   1.414214    -2.012134e-17   -0.102484 1   -0.707107   -2.421659e-16   -1.170283 2   -0.707107   -1.989771e-16   1.272767 ``

The explained Ratio seems to be equally distributed among two features:

``[0.5000000000000001, -9.680089716721685e-17, 0.5000000000000001] ``

Now, when I tried calculating the Q-Residual error in order to find the reconstruction error, I got zero for all the features:

``a    0.0 b    0.0 c    0.0 dtype: float64 ``

This would indicate that PCA on a non-correlated dataset like identity matrix gives us the projections which are very close to the original data-points. And the same results are obtained with the DIAGONAL MATRIX.

If the reconstruction error is very low, this would suggest that, in a single pipeline, we can fix the PCA method to execute and even if the dataset is not carrying much correlation we will get the same results after PCA transformation, but for the dataset which has high correlated features, we can prevent our curse of dimensionality.

Public views on this?

Contents