If variance of test dataset is lower than the one of the train dataset is it worth splitting the data? Since we know our dataset will always be limited is it fair to select models under the above condition? Thanks
You have to first figure out why you are splitting the data. The only reason that comes immediately to mind is that fitting the model is so laborious that you can only do it once. Otherwise, resampling methods are far better, starting with the Efron-Gong optimism bootstrap (see e.g. the R
rms package) or 10-fold cross-validation repeated 100 times.
- Solved – Creating dumthe variables before or after splitting to train/test datasets
- Solved – Cross validation and train test split
- Solved – Predicting with cross validation
- Solved – Out-of-bag error and error on test dataset for random forest
- Solved – do hyper-parameters optimization before model selection