Since I'm relatively new to regularized regressions, I'm concerned with the hughe differences lasso, ridge and elastic nets deliver.
My data set has the following characteristics:
- panel data set: > 900.000 obs. and over 50 variables
- highly unbalances
- 2-5 variables are highly correlated.
To select only a subset of the variables I used penalized logistic regression fitting the model:
$frac{1}{N} sum_{i=1}^{N}L(beta,X,y)-lambda[(1-alpha)||beta||^2_2/2+alpha||beta||_1] $
To determine the optimal $lambda$ I used cross validation which yileds the following results:
The elastic net looks quite similar to the Lasso, also proposing only 2 Variables.
So my main question is: why do these approaches deliver so different results?
According to the Lasso, I only do have 2 variables in the final model and according to the Ridge, I do have 34 variables?
So in the end – which approach is the right one?
And why are the results so extremely different?
Thanks a lot!
Best Answer
By mean squared error do you mean the Brier score? And for elastic net the plot should be 3-dimensional since there are 2 simultaneous penalty parameters. Don't force $alpha$ to be 0 or 1.
To answer your question, the lasso is spending information trying to be parsimonious, while a quadratic penalty is not trying to select features but is just trying to predict accurately. It is a fools errand to expect that a typical problem will result in a parsimonious model that is highly discriminating. In addition, the lasso is not stable, i.e., if you were to repeat the experiment the list of selected features would vary quite a lot.
For optimum prediction use ridge logistic regression. Elastic net is a nice compromise between that and lasso.
Similar Posts:
- Solved – Why is Lasso and Ridge not giving better results than OLS
- Solved – Why is Lasso and Ridge not giving better results than OLS
- Solved – For high dimensional data, does it make sense to do feature selection before running elastic net
- Solved – Why Ridge regularization has the grouping effect
- Solved – When to use and when not to use ridge regression