I am using a random forest for a 2 class classification problem. But eventually using probability of class "1" returned by the model for my task and not the label. I get AUC of about 70%
Then I compare the probability with the real world value and measure the difference (residual). Then I build a regression random forest model to predict the residual given the same features! This seems to be a weird idea but I tried it. Then I correct the probabilities returned by the first model with the output of the 2nd model and this improved performance in the "test set" Significantly! The 2nd model explains 85% of the variability.
What does this mean? Why is the first model not accurate enough? Even those the same features are used in the 2nd model, it improves performance.
Somehow, the model the predict the residual of the classification model has a higher performance compared to the classification model itself. And both models use the same features.
It means you're over-fitting the training data without assessing the generalization error using e.g. cross-validation.
You can avoid using cross validation when using random forests because of the way it estimates the out-of-bag (OOB) error as it goes. However, once you use those OOB prediction residuals as the inputs to the second random forest model in your pipeline this is no longer true. In order to get an estimate of the generalization error you need to think about this two stage process as components of a new model that needs to be assessed via cross validation. Build your model on a training sample then assess its accuracy on a test set and I guarantee your results will not look as good.
Another way to look at this is see what happens is instead of stopping at 2 random forest models you now built a third model to predict the residuals of the second. You'll get an even better result. Chain enough of these together and you'll predict the test set perfectly.