Solved – The limit of “unit-variance” ridge regression estimator when $lambdatoinfty$

Consider ridge regression with an additional constraint requiring that $hat{mathbf y}$ has unit sum of squares (equivalently, unit variance); if needed, one can assume that $mathbf y$ has unit sum of squares as well:

$$hat{boldsymbolbeta}_lambda^* = argminBig{|mathbf y – mathbf X boldsymbol beta|^2+lambda|boldsymbolbeta|^2Big} ::text{s.t.}:: |mathbf X boldsymbolbeta|^2=1.$$

What is the limit of $hat{boldsymbolbeta}_lambda^*$ when $lambdatoinfty$?


Here are some statements that I believe are true:

  1. When $lambda=0$, there is a neat explicit solution: take OLS estimator $hat{boldsymbolbeta}_0=(mathbf X^top mathbf X)^{-1}mathbf X^top mathbf y$ and normalize it to satisfy the constraint (one can see this by adding a Lagrange multiplier and differentiating):
    $$hat{boldsymbolbeta}_0^* = hat{boldsymbolbeta}_0 big/ |mathbf Xhat{boldsymbolbeta}_0|.$$

  2. In general, the solution is $$hat{boldsymbolbeta}_lambda^*=big((1+mu)mathbf X^top mathbf X + lambda mathbf Ibig)^{-1}mathbf X^top mathbf y::text{with $mu$ needed to satisfy the constraint}.$$I don't see a closed form solution when $lambda >0$. It seems that the solution is equivalent to the usual RR estimator with some $lambda^*$ normalized to satisfy the constraint, but I don't see a closed formula for $lambda^*$.

  3. When $lambdato infty$, the usual RR estimator $$hat{boldsymbolbeta}_lambda=(mathbf X^top mathbf X + lambda mathbf I)^{-1}mathbf X^top mathbf y$$ obviously converges to zero, but its direction $hat{boldsymbolbeta}_lambda big/ |hat{boldsymbolbeta}_lambda|$ converges to the direction of $mathbf X^top mathbf y$, a.k.a. the first partial least squares (PLS) component.

Statements (2) and (3) together make me think that perhaps $hat{boldsymbolbeta}_lambda^*$ also converges to the appropriately normalized $mathbf X^top mathbf y$, but I am not sure if this is correct and I have not managed to convince myself either way.

#A geometrical interpretation

The estimator described in the question is the Lagrange multiplier equivalent of the following optimization problem:

$$text{minimize $f(beta)$ subject to $g(beta) leq t$ and $h(beta) = 1$ } $$

$$begin{align} f(beta) &= lVert y-Xbeta lVert^2 \ g(beta) &= lVert beta lVert^2\ h(beta) &= lVert Xbeta lVert^2 end{align}$$

which can be viewed, geometrically, as finding the smallest ellipsoid $f(beta)=text{RSS }$ that touches the intersection of the sphere $g(beta) = t$ and the ellipsoid $h(beta)=1$


Comparison to the standard ridge regression view

In terms of a geometrical view this changes the old view (for standard ridge regression) of the point where a spheroid (errors) and sphere ($|beta|^2=t$) touch. Into a new view where we look for the point where the spheroid (errors) touches a curve (norm of beta constrained by $|Xbeta|^2=1$). The one sphere (blue in the left image) changes into a lower dimension figure due to the intersection with the $|Xbeta|=1$ constraint.

In the two dimensional case this is simple to view.

geometric view

When we tune the parameter $t$ then we change the relative length of the blue/red spheres or the relative sizes of $f(beta)$ and $g(beta)$ (In the theory of Lagrangian multipliers there is probably a neat way to formally and exactly describe that this means that for each $t$ as function of $lambda$, or reversed, is a monotonous function. But I imagine that you can see intuitively that the sum of squared residuals only increases when we decrease $||beta||$.)

The solution $beta_lambda$ for $lambda=0$ is as you argued on a line between 0 and $beta_{LS}$

The solution $beta_lambda$ for $lambda to infty$ is (indeed as you commented) in the loadings of the first principal component. This is the point where $lVert beta rVert^2$ is the smallest for $lVert beta X rVert^2 = 1$. It is the point where the circle $lVert beta rVert^2=t$ touches the ellipse $|Xbeta|=1$ in a single point.

In this 2-d view the edges of the intersection of the sphere $lVert beta rVert^2 =t$ and spheroid $lVert beta X rVert^2 = 1$ are points. In multiple dimensions these will be curves

(I imagined first that these curves would be ellipses but they are more complicated. You could imagine the ellipsoid $lVert X beta rVert^2 = 1$ being intersected by the ball $lVert beta rVert^2 leq t$ as some sort of ellipsoid frustum but with edges that are not a simple ellipses)


##Regarding the limit $lambda to infty$

At first (previous edits) I wrote that there will be some limiting $lambda_{lim}$ above which all the solutions are the same (and they reside in the point $beta^*_infty$). But this is not the case

Consider the optimization as a LARS algorithm or gradient descent. If for any point $beta$ there is a direction in which we can change the $beta$ such that the penalty term $|beta|^2$ increases less than the SSR term $|y-Xbeta|^2$ decreases then you are not in a minimum.

  • In normal ridge regression you have a zero slope (in all directions) for $|beta|^2$ in the point $beta=0$. So for all finite $lambda$ the solution can not be $beta = 0$ (since an infinitesimal step can be made to reduce the sum of squared residuals without increasing the penalty).
  • For LASSO this is not the same since: the penalty is $lvert beta rvert_1$ (so it is not quadratic with zero slope). Because of that LASSO will have some limiting value $lambda_{lim}$ above which all the solutions are zero because the penalty term (multiplied by $lambda$) will increase more than the residual sum of squares decreases.
  • For the constrained ridge you get the same as the regular ridge regression. If you change the $beta$ starting from the $beta^*_infty$ then this change will be perpendicular to $beta$ (the $beta^*_infty$ is perpendicular to the surface of the ellipse $|Xbeta|=1$) and $beta$ can be changed by an infinitesimal step without changing the penalty term but decreasing the sum of squared residuals. Thus for any finite $lambda$ the point $beta^*_infty$ can not be the solution.

##Further notes regarding the limit $lambda to infty$

The usual ridge regression limit for $lambda$ to infinity corresponds to a different point in the constrained ridge regression. This 'old' limit corresponds to the point where $mu$ is equal to -1. Then the derivative of the Lagrange function in the normalized problem

$$2 (1+mu) X^{T}X beta + 2 X^T y + 2 lambda beta$$ corresponds to a solution for the derivative of the Lagrange function in the standard problem

$$2 X^{T}X beta^prime + 2 X^T y + 2 frac{lambda}{(1+mu)} beta^prime qquad text{with $beta^prime = (1+mu)beta$}$$

Similar Posts:

Rate this post

Leave a Comment