Intro:
I have a dataset with a classical "large p, small n problem". The number available samples n=150 while the number of possible predictors p=400. The outcome is a continuous variable.
I want to find the most "important" descriptors, i.e., those that are best candidates for explaining the outcome and helping to build a theory.
After research on this topic I found LASSO and Elastic Net are commonly used for the case of large p, small n. Some of my predictors are highly correlated and I want to preserve their groupings in the importance assessment, therefore, I opted for Elastic Net. I suppose that I can use absolute values of regression coefficients as a measure of importance (please correct me if I am wrong; my dataset is standardized).
Problem:
As my number of samples is small, how can I achieve a stable model?
My current approach is to find best tuning parameters (lambda and alpha) in a grid search on 90% of the dataset with 10-fold cross-validation averaging MSE score. Then I train the model with the best tuning parameters on the whole 90% of dataset. I am able to evaluate my model using R squared on the holdout 10% of the dataset (which account to only 15 samples).
Running repeatedly this procedure, I found a large variance in R squared assessments. As well, the number of non-zeroed predictors varies as well as their coefficients.
How can I get a more stable assessment of predictors' importance and more stable assessment of final model performance?
Can I repeatedly run my procedure to create a number of models, and then average regression coefficients? Or should I use the number of occurrences of a predictor in the models as its importance score?
Currently, I get around 40-50 non-zeroed predictors. Should I penalize number of predictors harder for better stability?
Best Answer
"Sparse Algorithms are not Stable: A No-free-lunch Theorem"
I guess the title says a lot, as you pointed out.
[…] a sparse algorithm can have non-unique optimal solutions, and is therefore ill-posed
Check out randomized lasso, and the talk by Peter Buhlmann.
Update:
I found this paper easier to follow than the paper by Meinshausen and Buhlmann called "Stability Selection".
In "Random Lasso", the authors consider the two important drawbacks of the lasso for large $p$, small $n$ problems, that is,
- In the case where there exist several correlated variables, lasso only picks one or a few, thus leading to the instability that you talk about
- Lasso cannot select more variables than the sample size $n$ which is a problem for many models
The main idea for random lasso, that is able to deal with both drawbacks of lasso is the following
If several independent data sets were generated from the same distribution, then we would expect lasso to select nonidentical subsets of those highly correlated important variables from different data sets, and our final collection may be most, or perhaps even all, of those highly correlated important variables by taking a union of selected variables from different data sets. Such a process may yield more than $n$ variables, overcoming the other limitation of lasso.
Bootstrap samples are drawn to simulate multiple data sets. The final coefficients are obtained by averaging over the results of each bootstrap sample.
It would be great if somebody could elaborate on and explain this algorithm further in the answers.
Similar Posts:
- Solved – Model stability when dealing with large $p$, small $n$ problem
- Solved – Model stability when dealing with large $p$, small $n$ problem
- Solved – variable reduction before doing random forest in R
- Solved – Why can variable importance be negative/zero while its correlation with the response variable is high
- Solved – Testing variable importance in prediction