Say I have a feature matrix $X$ and a target $y$. I use $k$-fold cross validation to generate $k$ out-of-sample MSE curves as a function of a penalty parameter $lambda$
$$MSE_i(lambda) quad (i=1,dots,k)$$
Given these curves, how should I choose $lambda$? Two approaches I have seen are
Choose $lambda=lambda^*$ to minimize the average OOS mean square error.
Choose the largest $lambda$ that is within one standard error (taken over all cross validation sets) of the $lambda$ that minimizes the average OOS mean square error.
But it seems that 1. is too optimistic (I am likely to choose an overly complex model) and 2. is too pessimistic (there is a lot of correlation between the values of $MSE_i(lambda)$ at different points along the curve, so 1 std deviation is too much).
Is there a happy medium, or a 'best' approach?
Choose $λ=λ^∗$ to minimize the average OOS mean square error.
This strategy assumes you have enough independent test cases so the error on your OOS estimate is negligible.
You are right: if the error on the OOS measurements is not negligible, this can cause a bias towards too complex models. The reason is that if you compare
- many models of varying complexity
- that have essentially the same performance (i.e. you cannot distinguish their performance with the given validation set-up, particularly the given total no. of test cases),
- with a performance measurement that is subject to substantial variance,
you may "skim" the variance: the best observed performance may be caused by an (accidentally) favorable split of training and test sets rather than actually better generalization performance of the model.
The next weaker assumption is that there is some non-negligible error on the OOS estimate, but essentially the individual OOS measurements (for each surrogate model) still behave independently of each other:
Choose the largest $λ$ that is within one standard error (taken over all cross validation sets) of the $λ$ that minimizes the average OOS mean square error.
Otherwise, you need to take into account that you actually have only slightly varying models (only few training cases are exchanged between any two of the surrogate models) and only a finite number of distinct test cases. This means that the usual standard error calculation would overestimate the effective number measurements ($n$) and thus underestimate the standard error.
In consequence, in this situation you should select an even less complex model.
- Solved – When MSE for CV is greater than test MSE
- Solved – K-fold Cross Validation and Training/CV/Test set Techniques for choosing regularization parameter of Regression
- Solved – Elastic net beta coefficients using Glmnet with Caret
- Solved – inner cross validation
- Solved – the difference between different kind of cross validation methods