I have a dataset of 240 samples data, with 12 variables (independent variables). From these 12 variables, i would like to identify significant variables for prediction. I perform Gamma analysis, since the data is highly positive skewed.
Here's what I am doing currently:
1. Split the data set into 70% (training data) and 30% (test data)
2. I used 70% data (168 data) to build prediction model using Gamma analysis. I run the analysis few times by excluding each variable at one time to get the best final model.
3. Then validate the final model using the remaining 20% data.
My problem: When should I use K-fold cross validation? Is it when building the prediction model using 70% data, or after i get the final model and use k-fold CV on the remaining 20% data?
Traditionally, you do three steps of "tuning", so you need to split your sample into three parts: a training set, a cross-validation set and a test set.
Training (~60%) In training, you simply estimate your model, but you don't make any changes to the model based on the results (accuracy, goodness of fit) from the training data to avoid overfitting the training set.
Cross validation (~20%) After training your model, you can tune it – vary hyperparameters, remove features, or even select between different models – based on its performance on the cross validation set.
As an example, let's say you want to test which variables to include and which to leave: You specify three different variable combinations (three different models). You train all of them using your training set. Then you evaluate all of them using the cross validation set and select the one that performs best on the CV set.
K-fold CV If you are interested in doing k-fold validation, you repeat exactly what's written above, with one major difference: instead of hard-selecting the 60% and 20% for your training and CV sets, you run the training and validation procedures K-times, each time selecting a different random subsample for training and cross validation. Then you get a set of K results (accuracy, goodness of fit) that you can average to get a more robust estimate of your model's performance.
E.g., if you do 10-fold CV, you'd run it 10 times, and each time you'd randomly sample 10% of your data to be a cross-validation set, with the rest being a training set.
Test set (~20%) After tuning the model and/or selecting the best one, you can test it using the test set. This is data that the model has not seen yet and you shouldn't make any changes to the model based on the test set. This is the very last stage of building the model, only used to evaluate your final model, not to tune it any more (you don't want to overfit your test set).
If doing k-fold CV, you still have to leave out a test set that is separate from your training/CV set you are sampling from.
Putting it all together So in your case, you have $N=240$ and the number of variables is $12$. So the first split of the data would be training/CV (70-80%) and test (20-30%). Which in your case would be $168-192$ for training/CV and $48-72$ for test. Then, in selecting the variables to include, for each model (combination of variables), do K-fold CV as follows:
- Split your training/CV set into K equal (random) subsets.
- Estimate your model K times, each time leaving out one of the K subsets.
- Cross-validate each estimate with the subset that was left out.
- Pool your cross-validation results across all the K estimates.
Then pick the model that performs best in CV (on average). Evaluate it on the test set. Don't change it any more.