I have a problem with computing the ridge regression estimator with R.

In order to calculate the regression estimator of a data set, I created three samples of size 10.

`V = c(4, 3, 10, 1, 7, 10, 2, 6, 1, 4) W = c(16, 11, 16, 13, 12, 20, 11, 20, 16, 20) Y = c(26, 28, 27, 29, 24, 26, 22, 23, 28, 23) `

Now, one can easily determine the OLS estimators with the command:

`lm(Y~V+W) `

which gives:

`Coefficients: (Intercept) V W 27.70294 -0.06096 -0.11679 `

However, I also calculated the estimators manually with the OLS – formula

$$(X'X)^{-1}X'Y $$

`X <- cbind(rep(1,10),V,W) solve(t(X)%*%X)%*%t(X)%*%Y `

yielding the same results.

I wanted to repeat this with the ridge regression.

`library(MASS) lm.ridge(Y~V+W,lambda=0.1) `

which yiels:

` V W 27.68511898 -0.06088623 -0.11566871 `

These results are quite similiar to the OLS results, which makes sense, considering that λ is only 0.1.

Since the ridge estimator is $$(X'X + λI)^{-1}X'Y$$,

, I only the changed the upper expression to:

`solve(t(X)%*%X+0.1*diag(3))%*%t(X)%*%Y `

This yiels, however, quite different results:

` 22.86576934 V -0.09884927 W 0.19226126 `

Does anyone know, what I am doing wrong?

Thank you very much for your quick reply!

I was aware that my text was quite difficult to read, but didn't know how to make it better. The 4 leading spaces advice is quite valuable, indeed.

What you say about removing the penalty from the intercept makes perfect sense to me. So λ is not to be multiplied by the identity matrix, but by the slightly altered identity matrix, whose first row consist of zeros only.

However, I still quite don't understand why the λ should be squared. I thought the original ridge-regression-equation is:

$$(Y-Xβ)'(Y-Xβ) + λβ'β = Y'Y-2β'X'Y+β'X'Xβ+λβ'β$$

Taking the first derivate and equating with 0 yields:

$$-2X'Y+2X'Xβ+2λβ = 0 <=> X'Xβ+λβ=X'Y$$

$$ <=> (X'X+λI)β=X'Y <=> {hat beta}= (X'X+λI)X'Y$$

So I thought, in ridge regression the β are squared, but the λ are not, in contrast to Lasso, where we take the modulus of the β-vector.

Applying both your squared code and the not squared code, gives, however, in both cases slightly different result than using the inbuilt lm.ridge -function:

`lm.ridge(Y~V+W,lambda=0.1) V W 27.68511898 -0.06088623 -0.11566871 lambda <- cbind(0, rbind(0, diag(2))) solve(t(X) %*% X + 0.1*lambda) %*% t(X) %*% Y [,1] 27.70146491 V -0.06094557 W -0.11670492 solve(t(X) %*% X + 0.1^2*lambda) %*% t(X) %*% Y [,1] 27.70279419 V -0.06096134 W -0.11678579 `

So, altogether, I am still a little bit confused.

**Contents**hide

#### Best Answer

You need to standardize $X$ before applying the penalty, $lambda$, then transform the coefficients back to the scale of the original $X$. And the results will be the same with `lm.ridge`

.

Something like:

`r.01 <- crossprod(Xs) / (nrow(X) - 1) + diag(ncol(X)) * lambda as.numeric(tcrossprod(chol2inv(chol(r.01)), Xs / (nrow(X) - 1)) %*% y) / sd_X `

where `X`

is the original model matrix excluding the intercept. `Xs`

is `X`

standardized to have unit variance and `sd_X`

is vector of standard deviations of variables in `X`

.

### Similar Posts:

- Solved – Could applying ridge regression on small dataset improve predictive power
- Solved – Naive Ridge Regression in R
- Solved – Ridge regression results different in using lm.ridge and glmnet
- Solved – Ridge Regression: how to show squared bias increases as $lambda$ increases
- Solved – Maximum penalty for ridge regression