Let's say I have a large, un-balanced binary classification problem (in reality nrow is more like 500k, and ncol is more like 500):
set.seed(42) nrow <- 10000 ncol <- 50 X <- matrix(rnorm(nrow*ncol), ncol=ncol) Y <- X %*% rnorm(ncol(X)) * sample(0:1, ncol(X), replace=TRUE) + rnorm(nrow(X)) Y <- Y-20 Y <- exp(Y)/(1+exp(Y)) Y <- round(Y, 0) > sum(Y==1)/length(Y) [1] 0.0027
Before modeling, I down-sampled the negative class. I don't have a strong theoretical justification for doing this, but it makes my models fit a lot faster, and they seem to be better too.
keep <- which(Y==1) sample <- sample(which(Y==0), length(keep)) Xfull <- X Yfull <- Y X <- X[c(keep, sample),] Y <- Y[c(keep, sample),] > sum(Y==1)/length(Y) [1] 0.5
Fitting a model to the down-sampled dataset is pretty quick:
library(caret) Y <- factor(paste('X', Y, sep='')) X <- as.data.frame(X) model <- train(X, Y, method='glmnet', tuneGrid=expand.grid(.alpha=0:1, .lambda=0:30/10), trControl=trainControl( method='cv', summaryFunction=twoClassSummary, classProbs=TRUE)) plot(model)
And I can use the cross-validation folds to estimate some statistics about the model's predictive ability:
> max(model$results$ROC) [1] 0.9777778 > confusionMatrix(model) Cross-Validated (10 fold) Confusion Matrix (entries are percentages of table totals) Reference Prediction X0 X1 X0 41.0 3.3 X1 9.0 46.7
However, I would like to estimate these statistics on the FULL dataset, preferably without cross-validating my model on the full dataset, which would be extremely slow.
I was thinking of doing a naive re-scaling of the confusion matrix, like this:
scaling_factor <- 0.5/0.0027 CM <- confusionMatrix(model)$table * nrow(X) CM[,1] <- CM[,1]*scaling_factor > round(CM/sum(CM)*100, 2) Reference Prediction X0 X1 X0 81.56 0.04 X1 17.90 0.50
Does this seem like a reasonable calculation? Is there a similar method I could use to re-scale AUC? Or do I expect AUC to stay the same?
/edit: in response to B_Miner. I am fairly certain that fitting the downsampled model to the full dataset will overestimate its performance. It's easy to see why if we fit a random forest instead of a glmnet:
model <- train(X, Y, method='rf', trControl=trainControl( method='cv', summaryFunction=twoClassSummary, classProbs=TRUE))
And predict this model on the full dataset:
pred_full <- predict(model, Xfull, type='raw') > table(pred_full, Yfull) Yfull pred_full 0 1 X0 8531 0 X1 1442 27
Because every single positive instance was used to train the model, the model can perfectly predict these instances, even on the full dataset.
/edit2: To clarify. I understand the the down-sampled model is biased. However, I suspect that the model's bias is predictable and consistent. I'm looking for a theoretical way to correct for this bias, under the assumption that the removed negative observations come from the same distribution as the negative observations in the training set.
Best Answer
In response to the comments, here is the general answer on how to adjust the probabilities returned out of any predictive model that has been built on a stratified / oversampled data set. Since you artificially increased the density of one class, you need to adjust the predicted probability that the observation belongs to that class in the real world space.
Do note that this only works if the probability returned is indeed a probability of class membership and is well calibrated in the oversampled space. If for example, you are returning confidence values from an SVM, then this rescaling does not work and you need to first calibrate the scores (e.g. Platt).
If you have a model calculating probability of class membership where those predicted probabilities are close to the actual (on the oversampled data) probabilities (I normally decile the data and compare actual and predicted at each decile) then you can adjust as follows (I forget the source for this but comes from Bayes Theorem and is used by SAS EM for example). Original fraction is the proportion of "1s" in the full data, Oversampled fraction is the proportion of "1s" in the oversampled training set and scoring result is the probability from the model.
Similar Posts:
- Solved – How to one speed up this correlation calculation in R without multicore
- Solved – Implementing Balanced Random Forest (BRF) in R using RandomForests
- Solved – Implementing Balanced Random Forest (BRF) in R using RandomForests
- Solved – Why can’t I simulate variables with negative correlation? How to fix it
- Solved – tsCV auto.arima with xreg results in NAs