I have a problem with computing the ridge regression estimator with R.
In order to calculate the regression estimator of a data set, I created three samples of size 10.
V = c(4, 3, 10, 1, 7, 10, 2, 6, 1, 4) W = c(16, 11, 16, 13, 12, 20, 11, 20, 16, 20) Y = c(26, 28, 27, 29, 24, 26, 22, 23, 28, 23)
Now, one can easily determine the OLS estimators with the command:
lm(Y~V+W)
which gives:
Coefficients: (Intercept) V W 27.70294 -0.06096 -0.11679
However, I also calculated the estimators manually with the OLS – formula
$$(X'X)^{-1}X'Y $$
X <- cbind(rep(1,10),V,W) solve(t(X)%*%X)%*%t(X)%*%Y
yielding the same results.
I wanted to repeat this with the ridge regression.
library(MASS) lm.ridge(Y~V+W,lambda=0.1)
which yiels:
V W 27.68511898 -0.06088623 -0.11566871
These results are quite similiar to the OLS results, which makes sense, considering that λ is only 0.1.
Since the ridge estimator is $$(X'X + λI)^{-1}X'Y$$,
, I only the changed the upper expression to:
solve(t(X)%*%X+0.1*diag(3))%*%t(X)%*%Y
This yiels, however, quite different results:
22.86576934 V -0.09884927 W 0.19226126
Does anyone know, what I am doing wrong?
Thank you very much for your quick reply!
I was aware that my text was quite difficult to read, but didn't know how to make it better. The 4 leading spaces advice is quite valuable, indeed.
What you say about removing the penalty from the intercept makes perfect sense to me. So λ is not to be multiplied by the identity matrix, but by the slightly altered identity matrix, whose first row consist of zeros only.
However, I still quite don't understand why the λ should be squared. I thought the original ridge-regression-equation is:
$$(Y-Xβ)'(Y-Xβ) + λβ'β = Y'Y-2β'X'Y+β'X'Xβ+λβ'β$$
Taking the first derivate and equating with 0 yields:
$$-2X'Y+2X'Xβ+2λβ = 0 <=> X'Xβ+λβ=X'Y$$
$$ <=> (X'X+λI)β=X'Y <=> {hat beta}= (X'X+λI)X'Y$$
So I thought, in ridge regression the β are squared, but the λ are not, in contrast to Lasso, where we take the modulus of the β-vector.
Applying both your squared code and the not squared code, gives, however, in both cases slightly different result than using the inbuilt lm.ridge -function:
lm.ridge(Y~V+W,lambda=0.1) V W 27.68511898 -0.06088623 -0.11566871 lambda <- cbind(0, rbind(0, diag(2))) solve(t(X) %*% X + 0.1*lambda) %*% t(X) %*% Y [,1] 27.70146491 V -0.06094557 W -0.11670492 solve(t(X) %*% X + 0.1^2*lambda) %*% t(X) %*% Y [,1] 27.70279419 V -0.06096134 W -0.11678579
So, altogether, I am still a little bit confused.
Best Answer
You need to standardize $X$ before applying the penalty, $lambda$, then transform the coefficients back to the scale of the original $X$. And the results will be the same with lm.ridge
.
Something like:
r.01 <- crossprod(Xs) / (nrow(X) - 1) + diag(ncol(X)) * lambda as.numeric(tcrossprod(chol2inv(chol(r.01)), Xs / (nrow(X) - 1)) %*% y) / sd_X
where X
is the original model matrix excluding the intercept. Xs
is X
standardized to have unit variance and sd_X
is vector of standard deviations of variables in X
.
Similar Posts:
- Solved – Could applying ridge regression on small dataset improve predictive power
- Solved – Naive Ridge Regression in R
- Solved – Ridge regression results different in using lm.ridge and glmnet
- Solved – Ridge Regression: how to show squared bias increases as $lambda$ increases
- Solved – Maximum penalty for ridge regression