I am running a regression model both with Lasso and Ridge (to predict a discrete outcome variable ranging from 0-5). Before running the model, I use
SelectKBest method of
scikit-learn to reduce the feature set from 250 to 25. Without an initial feature selection, both Lasso and Ridge yield to lower accuracy scores [which might be due to the small sample size, 600]. Also, note that some features are correlated.
After running the model, I observe that the prediction accuracy is almost the same with Lasso and Ridge. However, when I check first 10 features after ordering them by the absolute value of coefficients, I see that there is at most %50 overlap.
That is, given that different importance of features were assigned by each method, I might have a totally different interpretationbased on the model I choose.
Normally, the features represent some aspects of user behavior in a web site. Therefore, I want to explain the findings by highlighting the features (user behaviors) with stronger predictive ability vs weaker features (user behaviors). However, I do not know how to move forward at this point. How should I approach to interpreting the model? For example, should combine both and highlight the overlapping one, or should I go with Lasso since it provides more interpretability?
Ridge regression encourages all coefficients to becomes small. Lasso encourages many/most[**] coefficients to become zero, and a few non-zero. Both of them will reduce the accuracy on the training set, but improve prediction in some way:
- ridge regression attempts to improve generalization to the testing set, by reducing overfit
- lasso will reduce the number of non-zero coefficients, even if this penalizes performance on both training and test sets
You can get different choices of coefficients if your data is highly correlated. So, you might have 5 features that are correlated:
- by assigning small but non-zero coefficients to all of these features, ridge regression can achieve low loss on training set, which might plausibly generalize to testing set
- lasso might choose only one single one of these, that correlates well with the other four. and there's no reason why it should pick the feature with highest coefficient in the ridge regression version
[*] for a definition of 'choose' meaning: assigns a non-zero coefficient, which is still a bit hand-waving, since ridge regression coefficients will tend to all be non-zero, but eg some might be like 1e-8, and others might be eg 0.01
[**] nuance: as Richard Hardy points out, for some use-cases, a value of $lambda$ can be chosen which will result in all LASSO coefficients being non-zero, but with some shrinkage
- Solved – Why Ridge regularization has the grouping effect
- Solved – lasso vs linear regression comparison
- Solved – How does Ridge Regression penalize for complexity if the coefficients are never allowed to go to zero
- Solved – Why lasso for feature selection
- Solved – If only prediction is of interest, why use lasso over ridge