Suppose you get to observe "matches" between buyers and sellers in a market. You also get to observe characteristics of both buyers and sellers which you would like to use to predict future matches & make recommendations to both sides of the market.
For simplicity, assume there are N buyers and N sellers and that each finds a match. There are N matches and (N-1)(N-1) non-matches. The all-inclusive training dataset has N + (N-1)*(N-1) observations, which can be prohibitively large. It would seem that randomly sampling from the (N-1)(N-1) non-matches and training an algorithm on that reduced data could be more efficient. My questions are:
(1) Is sampling from the non-matches to build a training dataset a reasonable way to deal with this problem?
(2) If (1) is true, is there is a rigorous way to decide how big of a chunk of (N-1)(N-1) to include?
Best Answer
If I understand correctly, you have a two class classification problem, where the positive class (matches) is rare. Many classifiers struggle with such a class imbalance, and it is common practice to sub-sample the majority class in order to obtain better performance, so the answer to the first question is "yes". However, if you sub-sample too much, you will end up with a classifier that over-predicts the minority positive class, so the best thing to do is to choose the sub-sampling ration to maximise performance, perhaps by minimising the cross-validation error where the test data has not been sub-sampled so you get a good indication of operational performance.
If you have a probabilistic classifier, that gives an estimate of the probabiility of class memebership, you can go one better and post-process the output to compensate for the difference between class frequencies in the training set and in operation. I suspect that for some classifiers, the optimal approach is to optimise both the sub-sampling ratio and the correction to the output by optimising the cross-validation error.
Rather than sub-sampling, for some classifiers (e.g. SVMs) you can give different weights to positive and negative patterns. I prefer this to sub-sampling as it means there is no variability in the results due to the particular sub-sample used. Where this is not possible, use bootstrapping to make a bagged classifier, where a different sub-sample of the majority class is used in each iteration.
The one other thing I would say is that commonly where there is a large class imbalance, false negative errors and false positive error are not equally bad, and it is a good idea to build this into the classifier design (which can be accomplished by sub-sampling or weighting patterns belonging to each class).
Similar Posts:
- Solved – Supervised learning with “rare” events, when rarity is due to the large number of counter-factual events
- Solved – Understanding the One-Vs-The-Rest classifier
- Solved – Understanding the One-Vs-The-Rest classifier
- Solved – Determining the number of weak classifiers to use in adaboost without overfitting
- Solved – Can we compare classifier scores in one-vs-all/one-vs-many