I have a rather small dataset of 4 000 points (140 features) to feed to a NN binary classifier. The problem is only ~700 of them represent the second class. Is it more common to resample the whole data set and then split, or first split and then resample? What are the advantages or disadvantages of each?
My current approach is trying to get a close to 1:1 ratio (2:1 seemed to work almost as well):
- Copy and add the elements that belong to the second class up to 4x
- Shuffle the data and split into 80:20 training/testing sets
I got what seem to be good results, but I'm thinking the fact that the same data points are very likely to be present in both sets might skew the result. These are the predictions of the different methods.
2:1 oversample whole set
4x oversample the training set
Of course this problem would be nonexistent given a larger amount of data, but as this is the case: does oversampling the whole data set produce an outcome not valid for evaluation? Or is it even the preferred method when working with a small sample?
Any insight would be greatly appreciated!
This question seems to address a similar problem, and say that oversampling the whole dataset is a bad idea. However, it has no mention of any specific classifier. In addition, the data here is not severely imbalanced. But it is a small sample.
Edit II: ADASYN
I used the ADASYN algorithm to produce synthetic samples. Sampling the whole set produced a more accurate result, but sampling the training set was indecisive. The accuracy is worse but the predictions themselves look better.
Your test set should be as close to a sample from the distribution on which you are actually going to apply your classifier as possible. I would definitely split your dataset first (in fact, that is usually the first thing I would do after obtaining a dataset), put away your test set, and then do everything you want to do on the training set. Otherwise, it is very easy for biases to creep in.
Same applies to your validation sets, e.g. if you are using cross-validation. What you really want is an estimate of how well your approach would work out-of-sample so that you can select the best approach. The best way to get that is to evaluate your approach on actual, unmodified out-of-sample data.