# Solved – Re-scaling a confusion matrix after down sampling one class

Let's say I have a large, un-balanced binary classification problem (in reality nrow is more like 500k, and ncol is more like 500):

``set.seed(42) nrow <- 10000 ncol <- 50 X <- matrix(rnorm(nrow*ncol), ncol=ncol) Y <- X %*% rnorm(ncol(X)) * sample(0:1, ncol(X), replace=TRUE) + rnorm(nrow(X)) Y <- Y-20 Y <- exp(Y)/(1+exp(Y)) Y <- round(Y, 0) > sum(Y==1)/length(Y) [1] 0.0027 ``

Before modeling, I down-sampled the negative class. I don't have a strong theoretical justification for doing this, but it makes my models fit a lot faster, and they seem to be better too.

``keep <- which(Y==1) sample <- sample(which(Y==0), length(keep)) Xfull <- X Yfull <- Y X <- X[c(keep, sample),] Y <- Y[c(keep, sample),] > sum(Y==1)/length(Y) [1] 0.5 ``

Fitting a model to the down-sampled dataset is pretty quick:

``library(caret) Y <- factor(paste('X', Y, sep='')) X <- as.data.frame(X) model <- train(X, Y, method='glmnet',                 tuneGrid=expand.grid(.alpha=0:1, .lambda=0:30/10),                trControl=trainControl(                  method='cv',                   summaryFunction=twoClassSummary,                   classProbs=TRUE)) plot(model) ``

And I can use the cross-validation folds to estimate some statistics about the model's predictive ability:

``> max(model\$results\$ROC) [1] 0.9777778 > confusionMatrix(model)  Cross-Validated (10 fold) Confusion Matrix   (entries are percentages of table totals)            Reference Prediction   X0   X1         X0 41.0  3.3         X1  9.0 46.7 ``

However, I would like to estimate these statistics on the FULL dataset, preferably without cross-validating my model on the full dataset, which would be extremely slow.

I was thinking of doing a naive re-scaling of the confusion matrix, like this:

``scaling_factor <- 0.5/0.0027 CM <- confusionMatrix(model)\$table * nrow(X) CM[,1] <- CM[,1]*scaling_factor > round(CM/sum(CM)*100, 2)           Reference Prediction    X0    X1         X0 81.56  0.04         X1 17.90  0.50 ``

Does this seem like a reasonable calculation? Is there a similar method I could use to re-scale AUC? Or do I expect AUC to stay the same?

/edit: in response to B_Miner. I am fairly certain that fitting the downsampled model to the full dataset will overestimate its performance. It's easy to see why if we fit a random forest instead of a glmnet:

``model <- train(X, Y, method='rf',                     trControl=trainControl(                      method='cv',                       summaryFunction=twoClassSummary,                       classProbs=TRUE)) ``

And predict this model on the full dataset:

``pred_full <- predict(model, Xfull, type='raw') > table(pred_full, Yfull)          Yfull pred_full    0    1        X0 8531    0        X1 1442   27 ``

Because every single positive instance was used to train the model, the model can perfectly predict these instances, even on the full dataset.

/edit2: To clarify. I understand the the down-sampled model is biased. However, I suspect that the model's bias is predictable and consistent. I'm looking for a theoretical way to correct for this bias, under the assumption that the removed negative observations come from the same distribution as the negative observations in the training set.

Contents