I have read as many questions as I could on model selection, cross validation, and hyperparameter tuning and I am still confused on how to partition a dataset for the full training/tuning process.
The scenario: I have 100,000 training instances and I need to pick between 3 competing models (say random forest, ridge, and SVR). I also need to tune the hyperparameters of the selected model. Here is how I think the process should look.
Step 1: Split the data into 80,000 training and 20,000 test sets.
Step 2: Using cross validation, train and evaluate the performance of each model on the 80,000 training set (e.g. using 10 fold cv I would be training on 72,000 and testing against 8,000 10 times).
Step 3: Use the 20,000 test set to see how well the models generalize to unseen data, and pick a winner (say ridge).
Step 4: Go back to the 80,000 training data and use cross validation to re-train the model and tune the ridge alpha level.
Step 5: Test the tuned model on the 20,000 test set.
Step 6: Train tuned model on full dataset before putting into production.
Is this approach generally correct? I know that this example skimps on technical details, but I am wondering specifically about the partitioning of the dataset for selecting and tuning.
If this is not correct, please provide the steps and numeric splits that you would use in this scenario.
Best Answer
I have also been crawling these threads on this topic.
Step 1: Split the data into 80,000 training and 20,000 test sets.
Step 2: Using cross validation, train and evaluate the performance of each model on the 80,000 training set (e.g. using 10 fold cv I would be training on 72,000 and testing against 8,000 10 times).
Ok up to this point!
Step 3: Use the 20,000 test set to see how well the models generalize to unseen data, and pick a winner (say ridge).
Either do this on a portion of the training set that was not used to tune parameters, or implement nested cross validation in your training set (e.g., use 3/4 of each fold to train and 1/4 to select among RF, logistic regression, etc).
Step 4: Go back to the 80,000 training data and use cross validation to re-train the model and tune the ridge alpha level.
Step 5: Test the tuned model on the 20,000 test set.
This would not be a valid estimate of the error as you've already used this data to choose one of the three RF, LR, etc..
Step 6: Train tuned model on full dataset before putting into production.
Tuning the model should be considered a step in the training process.
Say you have 2 models: RF with param NE = 100, 200; LR with param C = 0.1, 0.2.
You have 2 options (you can mix and match them as long as you adhere to the basic principle: if you use data to make a decision, don't use that same data to evaluate):
A
- Step 1. Split all data into
train_validate
andtest
. Put test in a vault. - Step 2. Split
train_validate
intotrain
andvalidate
. - Step 3. Train 2 RF on
train
with param NE = 100 and 200. Train 2 LR ontrain
with param C=0.1 and 0.2. Try all four models on thevalidate
. Choose the modelmodel_se
with the smallest error. This is your "modeling process". - Step 4. Unlock the vault and test
model_se
(as is) ontest
to get some error. This error (one number) will be expected error on unseen data.
(It appears you have many observations. There is no hard rule for this that I know of, but if your classes are balanced A might be most reasonable).
B
Convert step 1 into an (outer) loop. If you use 7 fold you will have 7
train_validates
and 7tests
.Convert step 2 and 4 into an (inner) loop. If you use 5 fold you will then 5 times create a
train
on which you will test the 4 models and then 5 times see which is best onvalidate
. Take the model
model_ba
with best average performance over folds.Test
model_ba
on the test set (in the outer fold) each time (each
one will be a different model). Since within each outer loop you
have an estimate of error, you will have 7 estimates errors. The
average of these errors isE
and the varianceV
.Rerun the modeling process Steps 2 and 3 from scratch on the entire
dataset. Eg, take 100% of the data and run step 2 to 4 (use the same train:validate split ratios or 5 fold CV that you used there).
You will return some modelM
. You can expect performanceE
from modelM
on unseen data. The varianceV
unfortunately
cannot be used to construct a 95% confidence interval (Bengio, 2004).
B is also known as 'nested cross validation,' but it is actually just plain cross validation of an entire modeling process (that involves tuning both parameters and hyperparameters or considering hyperparameters to be parameters and just tuning parameters (see here)). If you choose, B, it is worth running multiple iterations of B to see the variance of the entire process.
Other methods such as bootstrap may be preferable to cross validation. I have not had time to work out the details of why this is true.
Similar Posts:
- Solved – How to split dataset for model selection and tuning
- Solved – How to split dataset for model selection and tuning
- Solved – K nearest neighbors with nested cross validation
- Solved – K nearest neighbors with nested cross validation
- Solved – the difference between k-holdout and k-fold cross validation