# Solved – The limit of “unit-variance” ridge regression estimator when \$lambdatoinfty\$

Consider ridge regression with an additional constraint requiring that \$hat{mathbf y}\$ has unit sum of squares (equivalently, unit variance); if needed, one can assume that \$mathbf y\$ has unit sum of squares as well:

\$\$hat{boldsymbolbeta}_lambda^* = argminBig{|mathbf y – mathbf X boldsymbol beta|^2+lambda|boldsymbolbeta|^2Big} ::text{s.t.}:: |mathbf X boldsymbolbeta|^2=1.\$\$

What is the limit of \$hat{boldsymbolbeta}_lambda^*\$ when \$lambdatoinfty\$?

Here are some statements that I believe are true:

1. When \$lambda=0\$, there is a neat explicit solution: take OLS estimator \$hat{boldsymbolbeta}_0=(mathbf X^top mathbf X)^{-1}mathbf X^top mathbf y\$ and normalize it to satisfy the constraint (one can see this by adding a Lagrange multiplier and differentiating):
\$\$hat{boldsymbolbeta}_0^* = hat{boldsymbolbeta}_0 big/ |mathbf Xhat{boldsymbolbeta}_0|.\$\$

2. In general, the solution is \$\$hat{boldsymbolbeta}_lambda^*=big((1+mu)mathbf X^top mathbf X + lambda mathbf Ibig)^{-1}mathbf X^top mathbf y::text{with \$mu\$ needed to satisfy the constraint}.\$\$I don't see a closed form solution when \$lambda >0\$. It seems that the solution is equivalent to the usual RR estimator with some \$lambda^*\$ normalized to satisfy the constraint, but I don't see a closed formula for \$lambda^*\$.

3. When \$lambdato infty\$, the usual RR estimator \$\$hat{boldsymbolbeta}_lambda=(mathbf X^top mathbf X + lambda mathbf I)^{-1}mathbf X^top mathbf y\$\$ obviously converges to zero, but its direction \$hat{boldsymbolbeta}_lambda big/ |hat{boldsymbolbeta}_lambda|\$ converges to the direction of \$mathbf X^top mathbf y\$, a.k.a. the first partial least squares (PLS) component.

Statements (2) and (3) together make me think that perhaps \$hat{boldsymbolbeta}_lambda^*\$ also converges to the appropriately normalized \$mathbf X^top mathbf y\$, but I am not sure if this is correct and I have not managed to convince myself either way.

Contents

#A geometrical interpretation

The estimator described in the question is the Lagrange multiplier equivalent of the following optimization problem:

$$text{minimize f(beta) subject to g(beta) leq t and h(beta) = 1 }$$

begin{align} f(beta) &= lVert y-Xbeta lVert^2 \ g(beta) &= lVert beta lVert^2\ h(beta) &= lVert Xbeta lVert^2 end{align}

which can be viewed, geometrically, as finding the smallest ellipsoid $$f(beta)=text{RSS }$$ that touches the intersection of the sphere $$g(beta) = t$$ and the ellipsoid $$h(beta)=1$$

## Comparison to the standard ridge regression view

In terms of a geometrical view this changes the old view (for standard ridge regression) of the point where a spheroid (errors) and sphere ($$|beta|^2=t$$) touch. Into a new view where we look for the point where the spheroid (errors) touches a curve (norm of beta constrained by $$|Xbeta|^2=1$$). The one sphere (blue in the left image) changes into a lower dimension figure due to the intersection with the $$|Xbeta|=1$$ constraint.

In the two dimensional case this is simple to view. When we tune the parameter $$t$$ then we change the relative length of the blue/red spheres or the relative sizes of $$f(beta)$$ and $$g(beta)$$ (In the theory of Lagrangian multipliers there is probably a neat way to formally and exactly describe that this means that for each $$t$$ as function of $$lambda$$, or reversed, is a monotonous function. But I imagine that you can see intuitively that the sum of squared residuals only increases when we decrease $$||beta||$$.)

The solution $$beta_lambda$$ for $$lambda=0$$ is as you argued on a line between 0 and $$beta_{LS}$$

The solution $$beta_lambda$$ for $$lambda to infty$$ is (indeed as you commented) in the loadings of the first principal component. This is the point where $$lVert beta rVert^2$$ is the smallest for $$lVert beta X rVert^2 = 1$$. It is the point where the circle $$lVert beta rVert^2=t$$ touches the ellipse $$|Xbeta|=1$$ in a single point.

In this 2-d view the edges of the intersection of the sphere $$lVert beta rVert^2 =t$$ and spheroid $$lVert beta X rVert^2 = 1$$ are points. In multiple dimensions these will be curves

(I imagined first that these curves would be ellipses but they are more complicated. You could imagine the ellipsoid $$lVert X beta rVert^2 = 1$$ being intersected by the ball $$lVert beta rVert^2 leq t$$ as some sort of ellipsoid frustum but with edges that are not a simple ellipses)

##Regarding the limit $$lambda to infty$$

At first (previous edits) I wrote that there will be some limiting $$lambda_{lim}$$ above which all the solutions are the same (and they reside in the point $$beta^*_infty$$). But this is not the case

Consider the optimization as a LARS algorithm or gradient descent. If for any point $$beta$$ there is a direction in which we can change the $$beta$$ such that the penalty term $$|beta|^2$$ increases less than the SSR term $$|y-Xbeta|^2$$ decreases then you are not in a minimum.

• In normal ridge regression you have a zero slope (in all directions) for $$|beta|^2$$ in the point $$beta=0$$. So for all finite $$lambda$$ the solution can not be $$beta = 0$$ (since an infinitesimal step can be made to reduce the sum of squared residuals without increasing the penalty).
• For LASSO this is not the same since: the penalty is $$lvert beta rvert_1$$ (so it is not quadratic with zero slope). Because of that LASSO will have some limiting value $$lambda_{lim}$$ above which all the solutions are zero because the penalty term (multiplied by $$lambda$$) will increase more than the residual sum of squares decreases.
• For the constrained ridge you get the same as the regular ridge regression. If you change the $$beta$$ starting from the $$beta^*_infty$$ then this change will be perpendicular to $$beta$$ (the $$beta^*_infty$$ is perpendicular to the surface of the ellipse $$|Xbeta|=1$$) and $$beta$$ can be changed by an infinitesimal step without changing the penalty term but decreasing the sum of squared residuals. Thus for any finite $$lambda$$ the point $$beta^*_infty$$ can not be the solution.

##Further notes regarding the limit $$lambda to infty$$

The usual ridge regression limit for $$lambda$$ to infinity corresponds to a different point in the constrained ridge regression. This 'old' limit corresponds to the point where $$mu$$ is equal to -1. Then the derivative of the Lagrange function in the normalized problem

$$2 (1+mu) X^{T}X beta + 2 X^T y + 2 lambda beta$$ corresponds to a solution for the derivative of the Lagrange function in the standard problem

$$2 X^{T}X beta^prime + 2 X^T y + 2 frac{lambda}{(1+mu)} beta^prime qquad text{with beta^prime = (1+mu)beta}$$

Rate this post