I understand that we can employ regularization in a least squares regression problem as

$$boldsymbol{w}^* = operatorname*{argmin}_w left[ (mathbf y-mathbf{Xw})^T(boldsymbol{y}-mathbf{Xw}) + lambda|boldsymbol{w}|^2 right]$$

and that this problem has a closed-form solution as:

$$hat{boldsymbol{w}} = (boldsymbol{X}^Tboldsymbol{X}+lambdaboldsymbol{I})^{-1}boldsymbol{X}^Tboldsymbol{y}.$$

We see that in the 2nd equation, regularization is simply adding $lambda$ to the diagonal of $boldsymbol{X}^Tboldsymbol{X}$, which is done to improve the numerical stability of matrix inversion.

My current 'crude' understanding of numerical stability is that if a function becomes more 'numerically stable' then its output will be less significantly affected by the noise in its inputs. **I am having difficulties relating this concept of improved numerical stability to the bigger picture of how it avoids/reduces the problem of overfitting.**

I have tried looking at Wikipedia and a few other university websites, but they don't go deep into explaining why this is so.

**Contents**hide

#### Best Answer

In the linear model $Y=Xbeta + epsilon$, assuming uncorrelated errors with mean zero and $X$ having full column rank, the least squares estimator $(X^TX)^{-1}X^TY$ is an unbiased estimator for the parameter $beta$. However, this estimator can have high variance. For example, when two of the columns of $X$ are highly correlated.

The penalty parameter $lambda$ makes $hat{w}$ a biased estimator of $beta$, but it decreases its variance. Also, $hat{w}$ is the posterior expectation of $beta$ in a Bayesian regression with a $N(0,frac{1}{lambda}I)$ prior on $beta$. In that sense, we include some information into the analysis that says the components of $beta$ ought not be too far from zero. Again, this leads us to a biased point estimate of $beta$ but reduces the variance of the estimate.

In a setting where $X$ high dimensional, say $N approx p$, the least squares fit will match the data almost perfectly. Although unbiased, this estimate will be highly sensitive to fluctuations in the data because in such high dimensions, there will be many points with high leverage. In such situations the sign of some components of $hat{beta}$ can determined by a single observation. The penalty term has the effect of shrinking these estimates towards zero, which can reduce the MSE of the estimator by reducing the variance.

Edit: In my initial response I provided a link to a relevant paper and in my haste I removed it. Here it is: http://www.jarad.me/stat615/papers/Ridge_Regression_in_Practice.pdf