I'm just starting in machine learning and I can't figure out how does lasso method find which features are redundant to shrink their coefficients to zero?
Best Answer
There are many ways to think about regularization. I find the restricted optimization formulation to be quite intuitive.
$$hatbeta = argmin_beta |bf y – Xbfbeta|$$ $$text{subject to } |bf beta| leq lambda_star$$
Usually in the first line, we use the L2 norm (squared), which corresponds to Ordinary Least Squares estimates. The restriction gives a way of "shrinking" the estimates back to the origin.
If an L2 norm is used in the second line, we have Ridge Regression, which effectively pulls the OLS estimates back towards the origin. Useful in many situations, but not very good at setting estimates equal to 0.
If an L1 norm is used in the second line, we get LASSO. This is good for "feature selection" since it is able to effectively set inert or nearly-inert features to 0.
The following image illustrates this nicely in two dimensions.
Figure from Elements of Statistical Learning by Hastie, Tibshirani, and Friedman
Similar Posts:
- Solved – Why do we only see $L_1$ and $L_2$ regularization but not other norms
- Solved – What if LASSO cannot solve the multi-colinearity problem
- Solved – How is the lasso orthogonal design case solution derived
- Solved – How is the lasso orthogonal design case solution derived
- Solved – Understanding linear projection in “The Elements of Statistical Learning”