Solved – How to optimize a classification model, when you only care about the top 5% of the ROC Curve

Imagine a real world scenario where you are only allowed to guess between 0 and 5% of the total population. You have to say "Here I think these 5% of people have trait A" and you aren't allowed to guess more than that. The other thing is, only 3-5% of the people have trait A. So it's not necessarily an easy trait to pick up on.

I guess I don't care about the entire AUC, I only care about the AUC between 0.95 and 1.00.

As an aside, most of the modeling I do is in R using caret, is there any simple setting to adjust in the metric that'd be much appreciated:

model  <- train(  y = y, x = x                 , metric = "ROC"                 , method = "rpart"                 , trControl = 5FoldsClass                 ) 

The general topic of binary classification with strongly unbalanced classes has been cover to a certain extent in the thread with the same name. In very short: caret does allow for more imbalance-appropriate metrics like Cohen's kappa or Precision-Recall AUC; the PRAUC is relatively new, you can find it using the prSummary metric. You can also try resampling approaches where you will rebalance the sample during estimation so class features become more prominent.

Having said the above, you seem to have a particular constraint about the total number of positives $N$ you can predict. I can think two immediate work-arounds. Both of them rely on idea you are using a probabilistic classifier. Simply put, a probabilistic classifier is a classification routine that can output a measurement of belief about its prediction in the form of a $[0,1]$ decimal number we can interpreter as a probability. Elastic nets, Random Forests and various ensemble classifiers do offer this usually out-of-the-box. SVMs usually do not provide out-of-the-box probabilities but you can get them if you are willing to accept some approximations. Anyway, back to the work-arounds:

  1. Use a custom metric. Instead of evaluating the area below the whole PR curve we focus on the area that guarantees a minimum number of points. These are generally known as partial AUC metrics. They require us to define a custom performance metric. Check caret's trainControl summaryFunction argument for more on this. Let me stress you do not have to definitely look into an AUC. Given that we can estimate probabilities in each steps of our model training procedure, we can do an thresholding step within the estimation procedure right before evaluating our performance metric. Notice that in the case we "fix $N$", using the Recall (Sensitivity) value as a metric would be fine because it would immediately control for the fact we want $N$ points. (Actually in that case the Recall and Precision would be equal as the number of False Negatives would equate the number of False Positives.)

  2. Threshold the final output. Given one can estimate the probabilities of an item belonging to a particular class, we can pick the items with the $N$ highest probabilities related to the class of interest. This is very easy to implement as essentially we apply a threshold right before reporting our findings. We can estimate models and evaluate them using our favourite performance metrics without any really changes in our work-flow. This is a simplistic approach but it is the easiest way to satisfy the constraints given. If we use this approach it will be probably more relevant to use an AUC-based performance metric originally. That is because using something like Accuracy, Recall, etc. would suggest using a particular threshold $p$ (usually $0.5$) to calculate the metrics needed for model training – we do not want to do that as we will not calibrate that $p$ using this approach).

A very important caveat: we need to have a well-calibrated probabilistic classifier model to use this approach; ie. we need to have good consistency between the predicted class probabilities and the observed class rates (check caret's function calibration on this). Otherwise our insights will be completely off when it comes to discriminating between items. As a final suggestion I would recommend that you look at lift-curves; they will allow you to see how fast you can find a given number of positive examples. Given the restriction imposed probably lift charts will be very informative and probably you want to present them when reporting your findings.

Similar Posts:

Rate this post

Leave a Comment