# Solved – Ridge Regression with R

I have a problem with computing the ridge regression estimator with R.

In order to calculate the regression estimator of a data set, I created three samples of size 10.

``V = c(4, 3, 10, 1, 7, 10, 2, 6, 1, 4) W = c(16, 11, 16, 13, 12, 20, 11, 20, 16, 20) Y = c(26, 28, 27, 29, 24, 26, 22, 23, 28, 23) ``

Now, one can easily determine the OLS estimators with the command:

``lm(Y~V+W) ``

which gives:

``Coefficients: (Intercept)      V            W   27.70294     -0.06096     -0.11679   ``

However, I also calculated the estimators manually with the OLS – formula
$$(X'X)^{-1}X'Y$$

``X <- cbind(rep(1,10),V,W)  solve(t(X)%*%X)%*%t(X)%*%Y ``

yielding the same results.

I wanted to repeat this with the ridge regression.

``library(MASS)  lm.ridge(Y~V+W,lambda=0.1) ``

which yiels:

``                  V           W  27.68511898 -0.06088623 -0.11566871  ``

These results are quite similiar to the OLS results, which makes sense, considering that λ is only 0.1.

Since the ridge estimator is $$(X'X + λI)^{-1}X'Y$$,

, I only the changed the upper expression to:

``solve(t(X)%*%X+0.1*diag(3))%*%t(X)%*%Y ``

This yiels, however, quite different results:

``  22.86576934 V -0.09884927 W  0.19226126 ``

Does anyone know, what I am doing wrong?

I was aware that my text was quite difficult to read, but didn't know how to make it better. The 4 leading spaces advice is quite valuable, indeed.

What you say about removing the penalty from the intercept makes perfect sense to me. So λ is not to be multiplied by the identity matrix, but by the slightly altered identity matrix, whose first row consist of zeros only.

However, I still quite don't understand why the λ should be squared. I thought the original ridge-regression-equation is:
$$(Y-Xβ)'(Y-Xβ) + λβ'β = Y'Y-2β'X'Y+β'X'Xβ+λβ'β$$
Taking the first derivate and equating with 0 yields:
$$-2X'Y+2X'Xβ+2λβ = 0 <=> X'Xβ+λβ=X'Y$$
$$<=> (X'X+λI)β=X'Y <=> {hat beta}= (X'X+λI)X'Y$$
So I thought, in ridge regression the β are squared, but the λ are not, in contrast to Lasso, where we take the modulus of the β-vector.

Applying both your squared code and the not squared code, gives, however, in both cases slightly different result than using the inbuilt lm.ridge -function:

``lm.ridge(Y~V+W,lambda=0.1)                  V           W  27.68511898 -0.06088623 -0.11566871  lambda <-  cbind(0, rbind(0, diag(2))) solve(t(X) %*% X + 0.1*lambda) %*% t(X) %*% Y [,1]   27.70146491 V -0.06094557 W -0.11670492  solve(t(X) %*% X + 0.1^2*lambda) %*% t(X) %*% Y [,1]   27.70279419 V -0.06096134 W -0.11678579 ``

So, altogether, I am still a little bit confused.

Contents

You need to standardize $$X$$ before applying the penalty, $$lambda$$, then transform the coefficients back to the scale of the original $$X$$. And the results will be the same with `lm.ridge`.
``r.01 <- crossprod(Xs) / (nrow(X) - 1) + diag(ncol(X)) * lambda as.numeric(tcrossprod(chol2inv(chol(r.01)), Xs / (nrow(X) - 1)) %*% y) / sd_X ``
where `X` is the original model matrix excluding the intercept. `Xs` is `X` standardized to have unit variance and `sd_X` is vector of standard deviations of variables in `X`.