I have a Ridge regression model to estimate the coefficients of the true model $y = Xbeta + epsilon$. I have the standard model where $mathbb{E}[epsilon] = 0, mathrm{Var}(epsilon) = I.$ The ridge estimator of $beta$ is: $beta^mathrm{Ridge} = (X^top X + lambda I )^{-1} X^top y$

Assume we have a **fixed** testing point $x_0$. I have proved that by increasing $lambda$ the variance of estimation $$hat{f}(x_0) = x_0^top (X^top X + lambda I)^{-1} X^top y$$

is decreasing.

Now I want to show that by increasing $lambda$ the squared bias of the test estimation steadily increase.

I thought of using the bias-variance tradeoff, but it does not work since the tradeoff tells us

$$Error(x_0) = text{Irreducible Error} + mathrm{Bias}^2(hat{f}(x_0)) +mathrm{Variance}(hat{f}(x_0)) . $$

To show that increased variance implies decreased bias, we need to have the same $Error(x_0)$ but this is not the case.

So, how can I show that the bias of our ridge estimation on the test data **steadily increases** with increasing $lambda$?

**Contents**hide

#### Best Answer

I do not know if you are still interested in this issue. I think it will be useful for your problem to look at the limiting result of the estimator mean squared error (for a penalty parameter approaching infinity).

We can indicate with $hat{beta}_{r} = (X^top X + lambda I )^{-1} X^top y$ the ridge estimator and with $hat{beta} = (X^top X)^{-1} X^top y$ the OLS estimator (which is unbiased, hence $E(hat{beta}) = beta$). Now, if we define $K = (X^top X + lambda I )^{-1} X^top X$ we can verify that $hat{beta}_{r} = K hat{beta}$ (so $K$ *transforms* the OLS estimator in the ridge one).

Then, keeping in mind the definition of $K$, it can be demonstrated that (see e.g. Hoerl and Kennard, 1970):

$$ begin{array}{lll} MSE(hat{beta}_{r}) &= E[(hat{beta}_{r} – beta)^top (hat{beta}_{r} – beta)] = mbox{Var}(hat{beta}_{r}) + [mbox{Bias}(hat{beta}_{r})]^2 \ & = sigma^{2}mbox{tr}{K (X^{top} X)^{-1}K^{top}} + beta^{top}(K – I)^{top}(K – I)beta \ mbox{Var}(hat{beta}_{r}) &= sigma^{2}mbox{tr}{K (X^{top} X)^{-1}K^{top}} \ [mbox{Bias}(hat{beta}_{r})]^2 &= beta^{top}(K – I)^{top}(K – I)beta. end{array} $$

From above we can compute $$ lim_{lambda rightarrowinfty} MSE(hat{beta}_{r}) = beta^top beta\ $$

which is the squared bias of an **estimator equal to zero** (since the variance, as you pointed out, goes to zero for limiting $lambda$). I hope this helps a bit (also I hope the notation is correct and clear enough).