Solved – Consequences of overlap between training, validation and test data

If I'm splitting my data in training, validation and test data to assess different (sub)sets of features for my task.

What are the consequences if I (by mistake) split my data incorrectly? In the following cases:

  1. part of the training and validation overlap
  2. part of the validation and test overlap
  3. part of the test and training overlap

In case 1, the wrong classifier from the training step could be selected since the classifier trained on the overlapping part of the data has a higher chance of being selected.

In case 2, the test step will rate the classifier better than it is since the classifier was chosen based on part of the test data.

In case 3, the test step will rate the classifier better than it is since the classifier was trained on part of the test data.

Is my reasoning correct? Could I add something to it?

Yes, your reasoning seems fine. Lets me state it in slightly different words:

  • Training and cross validation overlap: model has seen data during training that is used in CV, so CV results will be overly optimistic. Hence a wrong model could be chosen form CV and its estimated model performance might be estimated too high.

  • Cross validation and test overlap: the test results will be overly optimistic as the model best performing on CV was selected to be used on test data. Hence the final (e.g. reported) model performance might be too high.

  • Training and test overlap: the test results will be overly optimistic, as the model was using the same data during training already – again, the final (e.g. reported) model performance might be too high.

Similar Posts:

Rate this post

Leave a Comment