Zou et al. "On the "degrees of freedom" of the lasso" (2007) show that the number of nonzero coefficients is an unbiased and consistent estimate for the degrees of freedom of the lasso.

It seems a little counterintuitive to me.

- Suppose we have a regression model (where the variables are zero mean)

$$y=beta x + varepsilon.$$

- Suppose an unrestricted OLS estimate of $beta$ is $hatbeta_{OLS}=0.5$. It could roughly coincide with a LASSO estimate of $beta$ for a very low penalty intensity.
- Suppose further that a LASSO estimate for a particular penalty intensity $lambda^*$ is $hatbeta_{LASSO,lambda^*}=0.4$. For example, $lambda^*$ could be the "optimal" $lambda$ for the data set at hand found using cross validation.
- If I understand correctly, in both cases the degrees of freedom is 1 as both times there is one nonzero regression coefficient.

**Question:**

- How come the degrees of freedom in both cases are the same even though $hatbeta_{LASSO,lambda^*}=0.4$ suggests less "freedom" in fitting than $hatbeta_{OLS}=0.5$?

**References:**

- Zou, Hui, Trevor Hastie, and Robert Tibshirani. "On the “degrees of freedom” of the lasso."
*The Annals of Statistics*35.5 (2007): 2173-2192.

**Contents**hide

#### Best Answer

Assume we are given a set of $n$ $p$-dimensional observations, $x_i in mathbb{R}^p$, $i = 1, dotsc, n$. Assume a model of the form: begin{align} Y_i = langle beta, x_irangle + epsilon end{align} where $epsilon sim N(0, sigma^2)$, $beta in mathbb{R}^p$, and $langle cdot, cdot rangle$ denoting the inner product. Let $hat{beta} = delta({Y_i}_{i=1}^n)$ be an estimate of $beta$ using fitting method $delta$ (either OLS or LASSO for our purposes). The formula for degrees of freedom given in the article (equation 1.2) is: begin{align} text{df}(hat{beta}) = sum_{i=1}^n frac{text{Cov}(langlehat{beta}, x_irangle, Y_i)}{sigma^2}. end{align}

By inspecting this formula we can surmise that, in accordance with your intuition, the *true* DOF for the LASSO will indeed be less than the *true* DOF of OLS; the coefficient-shrinkage effected by the LASSO should tend to decrease the covariances.

Now, to answer your question, the reason that the DOF for the LASSO is the same as the DOF for OLS in your example is just that there you are dealing with *estimates* (albeit unbiased ones), obtained from a particular dataset sampled from the model, of the true DOF values. For any particular dataset, such an estimate will not be equal to the true value (especially since the estimate is required to be an integer while the true value is a real number in general).

However, when such estimates are averaged over many datasets sampled from the model, by unbiasedness and the law of large numbers such an average will converge to the true DOF. In the case of the LASSO, some of those datasets will result in an estimator wherein the coefficient is actually 0 (though such datasets might be rare if $lambda$ is small). In the case of OLS, the estimate of the DOF is always the number of coefficients, *not* the number of non-zero coefficients, and so the average for the OLS case will not contain these zeros. This shows how the estimators differ, and how the average estimator for the LASSO DOF can converge to something smaller than the average estimator for the OLS DOF.

### Similar Posts:

- Solved – Lasso penalty only applied to subset of regressors
- Solved – How is the lasso orthogonal design case solution derived
- Solved – How is the lasso orthogonal design case solution derived
- Solved – Consistency of lasso
- Solved – Proof that the coefficients in an OLS model follow a t-distribution with (n-k) degrees of freedom