Solved – Ridge Regression with R

I have a problem with computing the ridge regression estimator with R.

In order to calculate the regression estimator of a data set, I created three samples of size 10.

V = c(4, 3, 10, 1, 7, 10, 2, 6, 1, 4) W = c(16, 11, 16, 13, 12, 20, 11, 20, 16, 20) Y = c(26, 28, 27, 29, 24, 26, 22, 23, 28, 23) 

Now, one can easily determine the OLS estimators with the command:

lm(Y~V+W) 

which gives:

Coefficients: (Intercept)      V            W   27.70294     -0.06096     -0.11679   

However, I also calculated the estimators manually with the OLS – formula
$$(X'X)^{-1}X'Y $$

X <- cbind(rep(1,10),V,W)  solve(t(X)%*%X)%*%t(X)%*%Y 

yielding the same results.


I wanted to repeat this with the ridge regression.

library(MASS)  lm.ridge(Y~V+W,lambda=0.1) 

which yiels:

                  V           W  27.68511898 -0.06088623 -0.11566871  

These results are quite similiar to the OLS results, which makes sense, considering that λ is only 0.1.

Since the ridge estimator is $$(X'X + λI)^{-1}X'Y$$,

, I only the changed the upper expression to:

solve(t(X)%*%X+0.1*diag(3))%*%t(X)%*%Y 

This yiels, however, quite different results:

  22.86576934 V -0.09884927 W  0.19226126 


Does anyone know, what I am doing wrong?


Thank you very much for your quick reply!

I was aware that my text was quite difficult to read, but didn't know how to make it better. The 4 leading spaces advice is quite valuable, indeed.

What you say about removing the penalty from the intercept makes perfect sense to me. So λ is not to be multiplied by the identity matrix, but by the slightly altered identity matrix, whose first row consist of zeros only.

However, I still quite don't understand why the λ should be squared. I thought the original ridge-regression-equation is:
$$(Y-Xβ)'(Y-Xβ) + λβ'β = Y'Y-2β'X'Y+β'X'Xβ+λβ'β$$
Taking the first derivate and equating with 0 yields:
$$-2X'Y+2X'Xβ+2λβ = 0 <=> X'Xβ+λβ=X'Y$$
$$ <=> (X'X+λI)β=X'Y <=> {hat beta}= (X'X+λI)X'Y$$
So I thought, in ridge regression the β are squared, but the λ are not, in contrast to Lasso, where we take the modulus of the β-vector.

Applying both your squared code and the not squared code, gives, however, in both cases slightly different result than using the inbuilt lm.ridge -function:

lm.ridge(Y~V+W,lambda=0.1)                  V           W  27.68511898 -0.06088623 -0.11566871  lambda <-  cbind(0, rbind(0, diag(2))) solve(t(X) %*% X + 0.1*lambda) %*% t(X) %*% Y [,1]   27.70146491 V -0.06094557 W -0.11670492  solve(t(X) %*% X + 0.1^2*lambda) %*% t(X) %*% Y [,1]   27.70279419 V -0.06096134 W -0.11678579 

So, altogether, I am still a little bit confused.

You need to standardize $X$ before applying the penalty, $lambda$, then transform the coefficients back to the scale of the original $X$. And the results will be the same with lm.ridge.

Something like:

r.01 <- crossprod(Xs) / (nrow(X) - 1) + diag(ncol(X)) * lambda as.numeric(tcrossprod(chol2inv(chol(r.01)), Xs / (nrow(X) - 1)) %*% y) / sd_X 

where X is the original model matrix excluding the intercept. Xs is X standardized to have unit variance and sd_X is vector of standard deviations of variables in X.

Similar Posts:

Rate this post

Leave a Comment