I am developing a text classification problem, in which at some time points, say at the end of each week, I receive a batch of feedback from users about correctly and wrongly classified inputs. I am trying different strategies for incorporating this feedback to improve the algorithm.
The base-line for me is the naive method of retraining a complete new model on the entire available data (including the new coming batch).
I have tried different strategies:
Assigning more weights to the miss-classified observations, and also more recent observations. In order to assign more weights I just copy the points with more weights to training data, e.g., if $x_1$ and $x_2$ have weights 1 and 2, respectively, the training data will be $(x_1, x_2, x_2)$.
Training a classifier on each coming batch and (with more weights on miss-classified observations) and then consider an ensemble of these simple classifiers, something like hard or soft voting for final decision.
Surprisingly the naive base-line eventually bits both methods. So the question is that why is this happening?
Here are the results. x-axis is the number of the coming batch, and the y-axis is the accuracy of the model on the new coming batch.
classification model is just a simple linear classifier as mean for comparison of methods for incorporating feedback.
Machine learning algorithms are, by definition, methods for minimizing the loss function. A model is therefore locally optimized for minimizing the loss function. It follows that, generally speaking, if model M1 is trained on loss function L1, and model M2 is trained on loss function L2, M2 will have a worse L1 score than M1 does, and M1 will have a worse L2 score than M2 does.
Your loss function does not weight misclassified inputs higher, so models trained with those inputs weighted higher will do more poorly on your loss function. If you were instead to measure your Strategy1 models on your weighted loss function, then Strategy1 would do better than the baseline. It's like if you bred quarter-horses (horses optimized for running quarter-mile races), then ran one of them in a 10-mile race. Of course it's not going to do as well as a horse breed that's been optimized for 10-mile races.
Now, something that can improve accuracy is, given an ensemble of classifier, give more weight to the classifiers that do better at the inputs misclassified by the other classifiers. This is the concept behind gradient boosting.
Something to look at is what the nature of the feedback is. For each input, can you tell from the feedback whether it was correctly classified? If not, then are there any patterns as to what inputs you get feedback on? If you are more likely to get feedback on misclassified inputs, then you might actually get a better model by weighting correctly classified inputs higher, because for every correctly classified input you see, there are several more that you didn't see.
- Solved – How total loss is manipulated in mini batch gradient descent as the loss function is calculated and minimized for mini batches
- Solved – How to RNN handle variable sized inputs?
- Solved – Strange batch loss in keras
- Solved – How to set mini-batch size in SGD in keras
- Solved – How to read a 10×10 confusion matrix