Suppose the event of interest occurs in approximately $10 %$ of the cases where the number of cases is around $5,000$. Should you use a penalized logistic regression for this or is regular logistic regression okay? In other words, what qualifies something as a rare event?
Best Answer
Don't do anything special.
However, and this is crucial: choose a good quality measure. And that is not classification accuracy, sensitivity, specificity or similar measures, such as ROC curves. These can be very misleading in the case of unbalanced data, "identifying" that simply labeling everything as the majority class is "optimal". Which it isn't.
Oversampling the minority class or undersampling the majority class won't solve this problem, because it amounts to biasing your model and pretending that the population is different than it truly is. Neither will collecting more data solve your problem, since the relation between majority and minority classes won't change.
Instead, use probabilistic models instead of hard thresholded 0-1 classification, and then use proper scoring-rules. ("Proper" is really part of the term. There are proper and non-proper scoring rules. Classification accuracy is a non-proper scoring rule, and that is why it is not useful.)
Frank Harrell, who knows what he is talking about, has written extensively on the topic:
Similar Posts:
- Solved – Data augmentation or weighted loss function for imbalanced classes
- Solved – Data augmentation or weighted loss function for imbalanced classes
- Solved – R – randomForest resample – replacement or not
- Solved – Name of mean absolute error analogue to Brier score
- Solved – Calibration after up and downsampling