Solved – How does lasso regularization select the “less important” features

I'm just starting in machine learning and I can't figure out how does lasso method find which features are redundant to shrink their coefficients to zero?

There are many ways to think about regularization. I find the restricted optimization formulation to be quite intuitive.

$$hatbeta = argmin_beta |bf y – Xbfbeta|$$ $$text{subject to } |bf beta| leq lambda_star$$

Usually in the first line, we use the L2 norm (squared), which corresponds to Ordinary Least Squares estimates. The restriction gives a way of "shrinking" the estimates back to the origin.

If an L2 norm is used in the second line, we have Ridge Regression, which effectively pulls the OLS estimates back towards the origin. Useful in many situations, but not very good at setting estimates equal to 0.

If an L1 norm is used in the second line, we get LASSO. This is good for "feature selection" since it is able to effectively set inert or nearly-inert features to 0.

The following image illustrates this nicely in two dimensions.

enter image description here

Figure from Elements of Statistical Learning by Hastie, Tibshirani, and Friedman

Similar Posts:

Rate this post

Leave a Comment