Solved – Python – sklearn PLSRegression: why is T != X0*W (scores, scaled data, weights, respectively)

Pretty much a complete newbie with PLS, Python, stats, (and stackexchange), sorry:

When using sklearn for PLSRegression, why is the resulting scores matrix not given by the product of (scaled) input and the weights matrix?

I.e. T = X0 * W

Minimum working example showing this is given below.

I have figured out in the meantime that the scores can be calculated according to an algorithm shown, e.g., here http://www.sciencedirect.com/science/article/pii/0003267086800289?via%3Dihub and that the scores can be calculated via the '.transform' method. But I struggle to understand why this choice was made. Can anyone tell me what's the benefit of having it this way?

Thanks a lot!

Cheers

import numpy as np  # PLS tools  from sklearn.preprocessing import scale from sklearn.cross_decomposition import PLSRegression  # just some numbers X = np.random.multivariate_normal(np.array([3,4,5]),np.diag([5,4,1]),100) y = np.dot(X,np.array([1,2,3]))+np.random.random(size=(100,))  pls = PLSRegression(n_components=2) pls.fit(scale(X),y)  (pls.x_scores_ - np.dot(scale(X),pls.x_weights_)) / pls.x_scores_  # differ significantly from second column on forward 

You are referring to NIPALS algorithm. In that algorithm, as the paper you referred shows, you deflate $X$ block while building up $Y$ block.

So you don't have a single $W$ matrix that can be applied to $X$ directly, instead the steps for calculation of scores are as following:

start with

$E = X$

for the first component (or latent variable, LV)

$t_1 = E w_1$

$E = E – (t_1p_1')$

for the second component

$t_2 = Ew_2$

$E = E – (t_2p_2')$

and so on…

Where $t_h$ is the $h^{th}$ scores vector, $w_h$ is the $h^{th}$ weights vector and $p_h$ is the $h^{th}$ loading vector of $X$

There is, however, another algorithm called SIMPLS which provides you exactly what you need; a single weights matrix to be applied directly on the $X$. In that manner, I personally find NIPALS to be confusing and SIMPLS to be superior.

TL;DR The reason is the NIPALS algorithm.

Similar Posts:

Rate this post

Leave a Comment