Solved – Probability cut-off value for Logistic Regression

I'm doing a project in predicting default risk for SBA loans using R. My data has 187,986 positives and only 17796 negatives (9.46%). Clearly this is an imbalanced dataset. That's why when I run accuracy rate checking for my model using probability cutoff value=0.5, the accuracy result is way too low as below,

> mytestset$LoanStatus_PIF<-ifelse(mytestset$LoanStatus_PIF=="PIF",1,0) > pred<-predict(final,newdata=mytestset,type="response") > y_pred_num<-ifelse(pred>0.5,1,0) > mean(y_pred_num==mytestset$LoanStatus_PIF) 

Result:
[1] 0.009682553

When I change the cutoff value to 0.998 and up, the result is way much impressive:

> y_pred_num<-ifelse(pred>0.5,1,0) > mean(y_pred_num==mytestset$LoanStatus_PIF) Result:  [1] 0.9997752 

Question:
1. Will my model be rejected because I manually choose the cut-off value?

  1. Does 0.999 make sense for a cutoff value? As common cut-off value is usually 0.5.
    Thanks !

Thanks for your question. I think that in order for us to provide you with a good answer, some additional information might be needed. I'll do my best to address what I see as the main issue here, and you can let me know if I'm missing the point of your question.

As you say, you have an imbalanced dataset. Whenever this is the case, looking at simply the "accuracy" (number of correctly flagged observations among all observations; i.e., (true positives + true negatives) / Total) can be very misleading.

To illustrate this in a situation such as yours, where one class is relatively rare, consider the most naive model that simply assigns a label of "positive" to every observation, regardless of probability (akin to taking a cutoff value of 0 or 1). This model would have an accuracy of 0.9135! But "accuracy" hides what is really going on because of the class imbalance. In reality, you have an accuracy of 0% with the 17,796 "negatives" and an accuracy of 100% with the far-more-frequent 187,986 "positives." What happened is that your model outsmarted you: after all, why should it take the effort to discriminate between "positives" and "negatives" when it can be correct >90% of the time no matter what else it does?

@Henry suggested correctly that you can force your model to care about discrimination by assigning "costs" to wrong decisions. However, for the more immediate concern of describing your model's performance, you need to disregard accuracy, as accuracy in this context is too misleading to be used (alone) to choose a "cutoff" value.

Other performance metrics that can be computed from your data might be of more use to you: for example, the F1 score is a composite metric of precision and recall, which tends to be far more robust to class imbalances.

I would suggest this: if you are truly just looking for a cutoff value,create a vector of possible cutoffs, such as:

test <- data.frame(prob = seq(1/1000, 1, length.out = 1000), val = NA) 

Then, loop through each cutoff value, computing the metric of choice (like F1) and determine where you reach an optimal value. This approach is far from complete, and may still be rejected if you can't properly defend it, but it is far less vulnerable than what you've proposed:

###Here's a quickly written function you can use to compute various performance metrics ConfusionStats <- function(model, threshold){   DF <- model.frame(model)   DF$prob <- predict(model, type = "response")   DF$flag <- ifelse(DF$prob > threshold, 1, 0)   Tab <- table(DF[[fit$formula %>% as.character %>% `[`(2)]], DF$flag) %>% as.data.frame    return(list(TP = Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1],           FP = Tab$Freq[Tab$Var1 == 0 & Tab$Var2 == 1],           TN = Tab$Freq[Tab$Var1 == 0 & Tab$Var2 == 0],           FN = Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 0],           Accuracy = (Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 0 & Tab$Var2 == 0]) / (Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 0 & Tab$Var2 == 0] + Tab$Freq[Tab$Var1 == 0 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 0]),           PPV =  Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] / ( Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 0 & Tab$Var2 == 1]),           Precision = Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] / ( Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 0 & Tab$Var2 == 1]),           Recall = Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] / (Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 0]),           Sensitivity = Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] / (Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 0]),           F1 = 2 * ( (Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] / ( Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 0 & Tab$Var2 == 1]) *                         Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] / (Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 0]))                      /( Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] / ( Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 0 & Tab$Var2 == 1]) +                           Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] / (Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 1] + Tab$Freq[Tab$Var1 == 1 & Tab$Var2 == 0])))))}  ###And here's where you use the function and do the looping and visualize the curve for(i in 1:1000){   test$val[i] <- ifelse(length(ConfusionStats(threshold = test$prob[i], model = final)$F1) == 0, NA, ConfusionStats(threshold = test$prob[i], model = final)$F1) } plot(test) 

Similar Posts:

Rate this post

Leave a Comment