Consider ridge regression with an additional constraint requiring that $hat{mathbf y}$ has unit sum of squares (equivalently, unit variance); if needed, one can assume that $mathbf y$ has unit sum of squares as well:

$$hat{boldsymbolbeta}_lambda^* = argminBig{|mathbf y – mathbf X boldsymbol beta|^2+lambda|boldsymbolbeta|^2Big} ::text{s.t.}:: |mathbf X boldsymbolbeta|^2=1.$$

What is the limit of $hat{boldsymbolbeta}_lambda^*$ when $lambdatoinfty$?

**Here are some statements that I believe are true:**

When $lambda=0$, there is a neat explicit solution: take OLS estimator $hat{boldsymbolbeta}_0=(mathbf X^top mathbf X)^{-1}mathbf X^top mathbf y$ and normalize it to satisfy the constraint (one can see this by adding a Lagrange multiplier and differentiating):

$$hat{boldsymbolbeta}_0^* = hat{boldsymbolbeta}_0 big/ |mathbf Xhat{boldsymbolbeta}_0|.$$In general, the solution is $$hat{boldsymbolbeta}_lambda^*=big((1+mu)mathbf X^top mathbf X + lambda mathbf Ibig)^{-1}mathbf X^top mathbf y::text{with $mu$ needed to satisfy the constraint}.$$I don't see a closed form solution when $lambda >0$. It seems that the solution is equivalent to the usual RR estimator with

*some*$lambda^*$ normalized to satisfy the constraint, but I don't see a closed formula for $lambda^*$.When $lambdato infty$, the usual RR estimator $$hat{boldsymbolbeta}_lambda=(mathbf X^top mathbf X + lambda mathbf I)^{-1}mathbf X^top mathbf y$$ obviously converges to zero, but its

*direction*$hat{boldsymbolbeta}_lambda big/ |hat{boldsymbolbeta}_lambda|$ converges to the direction of $mathbf X^top mathbf y$, a.k.a. the first partial least squares (PLS) component.

Statements (2) and (3) together make me think that perhaps $hat{boldsymbolbeta}_lambda^*$ also converges to the appropriately normalized $mathbf X^top mathbf y$, but I am not sure if this is correct and I have not managed to convince myself either way.

#### Best Answer

#A geometrical interpretation

The estimator described in the question is the Lagrange multiplier equivalent of the following optimization problem:

$$text{minimize $f(beta)$ subject to $g(beta) leq t$ and $h(beta) = 1$ } $$

$$begin{align} f(beta) &= lVert y-Xbeta lVert^2 \ g(beta) &= lVert beta lVert^2\ h(beta) &= lVert Xbeta lVert^2 end{align}$$

which can be viewed, geometrically, as finding the smallest ellipsoid $f(beta)=text{RSS }$ that touches the intersection of the sphere $g(beta) = t$ and the ellipsoid $h(beta)=1$

## Comparison to the standard ridge regression view

In terms of a geometrical view this changes the *old* view (for standard ridge regression) of the point where **a spheroid (errors) and sphere ($|beta|^2=t$) touch**. Into a new view where we look for the point where **the spheroid (errors) touches a curve (norm of beta constrained by $|Xbeta|^2=1$)**. The one sphere (blue in the left image) changes into a lower dimension figure due to the intersection with the $|Xbeta|=1$ constraint.

In the two dimensional case this is simple to view.

When we tune the parameter $t$ then we change the relative length of the blue/red spheres or the relative sizes of $f(beta)$ and $g(beta)$ *(In the theory of Lagrangian multipliers there is probably a neat way to formally and exactly describe that this means that for each $t$ as function of $lambda$, or reversed, is a monotonous function. But I imagine that you can see intuitively that the sum of squared residuals only increases when we decrease $||beta||$.)*

The solution $beta_lambda$ for $lambda=0$ is as you argued on a line between 0 and $beta_{LS}$

The solution $beta_lambda$ for $lambda to infty$ is (indeed as you commented) in the loadings of the first principal component. This is the point where $lVert beta rVert^2$ is the smallest for $lVert beta X rVert^2 = 1$. It is the point where the circle $lVert beta rVert^2=t$ touches the ellipse $|Xbeta|=1$ in a single point.

In this 2-d view the edges of the intersection of the sphere $lVert beta rVert^2 =t$ and spheroid $lVert beta X rVert^2 = 1$ are points. In multiple dimensions these will be curves

*(I imagined first that these curves would be ellipses but they are more complicated. You could imagine the ellipsoid $lVert X beta rVert^2 = 1$ being intersected by the ball $lVert beta rVert^2 leq t$ as some sort of ellipsoid frustum but with edges that are not a simple ellipses)*

##Regarding the limit $lambda to infty$

*At first (previous edits) I wrote that there will be some limiting $lambda_{lim}$ above which all the solutions are the same (and they reside in the point $beta^*_infty$). But this is not the case*

Consider the optimization as a LARS algorithm or gradient descent. If for any point $beta$ there is a direction in which we can change the $beta$ such that the penalty term $|beta|^2$ increases less than the SSR term $|y-Xbeta|^2$ decreases then you are not in a minimum.

- In
**normal ridge regression**you have a zero slope (in all directions) for $|beta|^2$ in the point $beta=0$. So for all finite $lambda$ the solution can not be $beta = 0$ (since an infinitesimal step can be made to reduce the sum of squared residuals without increasing the penalty). **For LASSO**this is*not*the same since: the penalty is $lvert beta rvert_1$ (so it is not quadratic with zero slope). Because of that LASSO will have some limiting value $lambda_{lim}$ above which all the solutions are zero because the penalty term (multiplied by $lambda$) will increase more than the residual sum of squares decreases.**For the constrained ridge**you get the same as the regular ridge regression. If you change the $beta$ starting from the $beta^*_infty$ then this change will be*perpendicular*to $beta$ (the $beta^*_infty$ is perpendicular to the surface of the ellipse $|Xbeta|=1$) and $beta$ can be changed by an infinitesimal step without changing the penalty term but decreasing the sum of squared residuals. Thus for any finite $lambda$ the point $beta^*_infty$ can not be the solution.

##Further notes regarding the limit $lambda to infty$

The usual ridge regression limit for $lambda$ to infinity corresponds to a different point in the constrained ridge regression. This 'old' limit corresponds to the point where $mu$ is equal to -1. Then the derivative of the Lagrange function in the normalized problem

$$2 (1+mu) X^{T}X beta + 2 X^T y + 2 lambda beta$$ corresponds to a solution for the derivative of the Lagrange function in the standard problem

$$2 X^{T}X beta^prime + 2 X^T y + 2 frac{lambda}{(1+mu)} beta^prime qquad text{with $beta^prime = (1+mu)beta$}$$