Solved – Data standardization for training and testing sets for different scenarios

I am aware this has been discussed here. But I would like to ask a bit more about the topic under different scenarios.

  1. Standardizing the whole dataset before splitting. In general this is considered as wrong because of the information leakage from training to testing. What if I have a sufficiently large dataset and all data points are well mixtured such that training and testing sets are draw from the same stable distribution? How realistic would this assumption be?

  2. Standardizing training and testing sets separately. Would the assumption in 1. renders this way of standardizing sensible?

  3. Using the stats of training set to standardize testing set. This is recommended everywhere. The potential drawback of way would be if the entire dataset is small, the training data stats might result in abnormaly large or small data points in testing data.

  4. Using potential population stats to standardize data. For example, if I am using human proteins to do some kind of classification, then potential population could be the all proteins of life systems, or of all mammals. The big problem here is that even all natural proteins combined is still very tiny comparing to the universal set of combinatorial space of sequences consisting of 20 amino acids. Then my question is how reliable it is to use the potential population in this case.

Many thanks in advance!

I hope this helps:

  1. This is wrong in general because of info leakage as you say. If the data truly comes from the same distribution (although this only really happens with simulated data) and you have a ton of data, there isn't even much point of having a test dataset. I have to emphasize that this will only happen in contrived examples…

  2. This is not good because your model is the train dataset's stats, plus the actual model (e.g. regression). If you do this, then your predictions will depend on the test set! So if you have 10 or 20 testing examples, you'll get different predictions. And also, you can't even predict with 1 example because you would not be able to compute the stddev

  3. Do this. Whatever problems you have, you'll be worse off using the other methods. If your dataset if truly so small that you can't reliably compute the means and stddevs of the variables, it is unrealistic to build a model anyways.

  4. This could work as long as two things happen: (a) the test set is not a part of your large population, and (b) you always use the same values to standardize (that is, so your model is one entity that doesn't change with test data, just like in (3))

I hope this sheds some light on your questions!

Similar Posts:

Rate this post

Leave a Comment