I am working on a text classification project in which we have hundreds of (imbalanced) classes. Some characteristics of the data:
- We have examples of "bad" documents. Basically documents that don't fill in any other class. We may remove those.
- The documents are small (< 100 characters).
- The documents are very similar within the same class, but very different between different classes. The only exception is the "bad" class, which contains random documents with a very diverse vocabulary.
- The most frequent class has around 30k observations (it is the "bad" class) and others could have less than a hundred. Most of them on the thousands
- The frequencies of classes are for the whole data (330k observations), but it is not labelled. We estimated these frequencies with clustering.
The way we proceeded was to sample observations from each cluster and labeling them. The samples are proportional to the cluster size. We then resulted in 133 classes, where the most frequent class has 3k observations and the minority class has 10 observations.
That meant a very small performance for the minority classes, even considering that they have their own very specific vocabulary (f1-micro 0.79, f1-macro 0.23).
I've seen in other threads some advice that don't seem applicable to my case. Namely:
- Oversampling, undersampling, smote: I'm not using OvA or OvO, but rather a multinomial logistic regression. The biggest reason is that I have too many classes for those. But even if I could use these approaches, that would mean changing the distribution of labels in the training set. I can see that in the case of binary classification we can adjust the prediction threshold, but I can't see how that would work for multiclass.
- Changing perfomance metric : I'm already doing this, but it doesn't change the fact that minority classes have a poor performance. This is less bad when we remove the "bad" documents. But when they are there the performance of these small classes gets worse, as the big "bad" cluster shares a bit of the vocabulary by chance.
- Weighting: Same problem as before.
- Boosting, decision trees: not good for this type of data. I'm using tf-idf represention. Not using pretrained embeddings because the vocabulary is very specific.
There is no real answer to your question, because it really depends on what you are trying to archive, i.e. is your goal to get a very high classification accuracy or is it rather data exploration?
If you are purely interested in the classification, you should ask yourself the following questions:
Do I expect the same class priors for new samples? If yes, any over or under-sampling will lead to a bad model by definition, since you essentially train the model on a different distribution.
What are the consequences of misclassifying a sample? In many cases, the cost of misclassifying a sample is not the same for all classes, e.g. falsely assign a model to the 'bad document' class might have less sever consequences than assigning it to other classes.
Generally, a model will always try to minimize the loss and it doesn't care how this is archived. In a balanced context, this is solely done by learning correlation between predictors and the response, however in cases of class imbalance, the model will also learn the prior distribution, which is independent of the predictors. This is not a misbehavior of the model in case the actual distribution has these priors! (In this context I want to link a very good answer by Stephan Kolassa about the general issues when evaluating models based on accuracy.)
If you are less interested in the actual classification but more in question such as 'what are the main predictors for the response?', 'do predictors interact?' or 'how big is the deterministic component / the learnability of this problem?', it can make sense to balance classes such that the model doesn't learn the priors but rather the associations between predictors and response, since those could be mask in by class imbalance, especially if you deal with sparse data. However, keep in mind that the resulting model is unfit for classifying data following the original distribution.