In orthogonal design of lasso, we get $hat{beta}_j^{text{lasso}} = 0 text{ if abs}(hat{beta}_j) le lambda /2$. WHY?
I've seen the answer and derived it myself, but don't know why.
We begin with definition of lasso,
$$hat{beta}^{text{lasso}} = underset{x} {argmin} sum_{i=1}^{n} ( y_i – sum_{j=1}^{p}beta_{ij}x_{ij} )^2 + lambda sum_{j=1}^{p} |beta_j| $$
In orthogonal design case where $X^T X= I$, $hat{beta} = (X^TX)^{-1}X^{T}y = X^Ty$
begin{align} L(beta, lambda) & = sum_{i=1}^{n} ( y_i – sum_{j=1}^{p}beta_{ij}x_{ij} )^2 + lambda sum_{j=1}^{p} |beta_j| \
& = (Y – X beta)^T(Y – X beta) + lambda mathbf{I}_p text{ abs}(beta) \
& = Y^TY -2hat{beta}^Tbeta + beta^T beta+ lambda mathbf{I}_p text{ abs}(beta) \
& = Y^TY + sum_{j=1}^{p} L_j(beta_j, lambda)
end{align}
where $L_j(beta_j, lambda) = -2 hat{beta}_j beta_j + 2beta^2_j + lambda text{ abs}(beta_j)$.
Leave aside $beta_j=0$, take dereivative w.r.t. $beta_j$ for abs$(beta_j) > 0$,
$$frac{L_j(beta_j, lambda)}{partial beta_j} = -2 hat{beta}_j + 2beta_j + lambda text{ sign}(beta_j)$$
and $hat{beta}^{text{lasso}}$ is either zero or solve,
$$beta_j + lambda text{ sign}(beta_j) / 2 = hat{beta}_j,$$
which is,
$$
hat{beta}^{lasso}_j =
begin{cases}
hat{beta}_j – lambda/2, & text{if } hat{beta}_j > lambda/2\
hat{beta}_j + lambda/2, & text{if } hat{beta}_j < -lambda/2
end{cases}
$$
My question is the following derivation,
If abs$(hat{beta}_j) le lambda / 2$, we get
$$L_j(beta, lambda)
= -2 hat{beta}_j beta_j + 2beta^2_j + lambda text{ abs}(beta_j)
ge -lambda text{ abs}(beta_j) + lambda text{ abs}(beta_j)
ge 0 = L_j(0, lambda)$$
and, we can tell $hat{beta}_j^{text{lasso}} = 0 text{ if abs}(hat{beta}_j) le lambda /2$ (Why? How can you tell?)
Why $mathbf{hat{beta}_j^{text{lasso}} = 0}$? The explanation of $L_j(beta_j, lambda) ge L_j(0, lambda)$ does not seem to justify the reason.
Best Answer
Your derivation is not really precise, you are not really taking the derivative, but the subderivative, the function $|x|$ is not differentiable when $x = 0$. The subderivative $s$ of the absolute value when $x =0$ is $sin [-1, 1]$
Thus, the conditions you derived are for the case where $hat{beta}^{lasso}_j neq 0$ where indeed the subdifferential of the absolute value is equal the sign. But now consider the case $hat{beta}^{lasso}_j = 0$. By the KKT conditions, this will happen when $-hat{beta}_j^{ols} + sfrac{lambda}{2} = 0$ which implies $|hat{beta}_j^{ols}| leq frac{lambda}{2}$, since $sin [-1, 1]$ when $hat{beta}^{lasso}_j = 0$.
The LASSO problem
For the sake of completeness I will write down the the lasso problem here. Our goal is to minimize
$$min_{beta} || Y – Xbeta||_2^2 + lambda||beta||_1$$
where $||cdot||_1$ is the $l_1$ norm. This a convex optimization problem, and the optimum is characterized by the KKT conditions:
$$ -2X'(Y – Xbeta) + lambda s = 0 $$
where $s$ is the subgradient of the $l_1$ norm, that is, $s_j = sign(beta_j)$ if $beta_j neq 0$ and $s_j in [-1, 1]$ if $beta_j = 0$.
In the orthonormal case, $X'Y = hat{beta}^{OLS}$ and $X'X = I$, simplifying this to:
$$ -2hat{beta}^{OLS} +2beta + lambda s = 0 $$
Thus, consider the case where the solution would be $beta_j = 0$. For this to be true we must have that $-2hat{beta}_j^{OLS} + lambda s_j = 0$ which implies $|hat{beta}_j^{OLS}| leq frac{lambda}{2}$, since $s_i in [-1, 1]$. Since this a convex program, KKT is sufficient, and the condition works both ways, that is, $|hat{beta}_j^{OLS}| leq frac{lambda}{2} implies beta_j = 0$