I'm trying to perform some classification analyses with a relatively small dataset (201 observations, 32 predictors). There are 8 classes in my data with unequal sample sizes ranging from 10 in the least popular class to 43 in the most popular. Using CART and RF the classification performance is quite poor at ~50% for CART and ~65% for RF.
Out of interest I sampled with replacement 383 samples from the original 201 samples. The sample sizes for each class are still very unbalanced (20 in least popular, 77 in most popular class). I tested the bootstrap dataset with CART and RF and the performance of both classifiers is much better ~80% for CART and ~90% for RF. I've got two questions related to this:
1) Why does sampling with replacement increase the accuracy of predictive models when no "new" data is created i.e. all the extra samples come from the same original dataset, and the classes are still very unbalanced,
2) Is this a legitimate way to improve model performance if explained and compared to the original dataset?
There are a lot of questions on here about bootstrapping data, but I can't seem to find any related to classification accuracy in predictive models.
Here is an example using the iris dataset in R where the original dataset has an error rate of 4% compared to 0.33% for the bootstrap dataset.
library(randomForest) data(iris) set.seed(514); iris.boot <- iris[sample(nrow(iris),size=(nrow(iris)*2),replace=TRUE),] iris.rf <- randomForest(Species ~ ., data = iris, ntree=500) iris.rf Call: randomForest(formula = Species ~ ., data = iris, ntree = 500) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 3 47 0.06 irisboot.rf <- randomForest(Species ~ ., data = iris.boot, ntree=500) irisboot.rf Call: randomForest(formula = Species ~ ., data = iris.boot, ntree = 500) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 0.33% Confusion matrix: setosa versicolor virginica class.error setosa 95 0 0 0.000000000 versicolor 0 107 1 0.009259259 virginica 0 0 97 0.000000000
Response to comments
In my actual work I'm using 10 x 10 CV to tune models and to hopefully minimise overfitting and reduce any optimistic bias in my model results – I'm also assessing the model performance using kappa and % agreement. Is this a better approach than using classification error by itself?
Also, you have said that using the bootstrap the way I have will result in model overfitting – which I believe would result in poor predictive accuracy on new samples not seen by the model during building? However, using the iris dataset again as an example I get a kappa of 1 and % agreement of 100% on "new" samples i.e. those that were not included in the bootstrap dataset
library(randomForest) library(irr) data(iris) iris$ObsNumber <- 1:150 set.seed(514); iris.boot <- iris[sample(nrow(iris),size=(nrow(iris)*2),replace=TRUE),] validation.set <- subset(iris, !(iris$ObsNumber %in% iris.boot$ObsNumber)) iris$ObsNumber <- NULL iris.boot$ObsNumber <- NULL validation.set$ObsNumber <- NULL iris.rf <- randomForest(Species ~ ., data=iris, ntree=500) bootiris.rf <- randomForest(Species ~ ., data=iris.boot, ntree=500) predictions <- predict(bootiris.rf, validation.set) kappa2(data.frame(predictions, validation.set$Species)) Cohen's Kappa for 2 Raters (Weights: unweighted) Subjects = 19 Raters = 2 Kappa = 1 z = 6.12 p-value = 9.23e-10 agree(data.frame(predictions, validation.set$Species)) Percentage agreement (Tolerance=0) Subjects = 19 Raters = 2 %-agree = 100
Also, is there an easy way to implement the Effron-Gong bootstrap you mention with randomForests?
Best Answer
There are several issues:
- The sample size is far too low to reliably do what you are attempting
- Classification error is an improper scoring rule that is optimized by an incorrect model with incorrect features and incorrect weights
- You are using the bootstrap incorrectly. The bootstrap, relying on samples with replacement, results in duplications of observations that increases the amount of overfitting.
- With the more appropriate Efron-Gong optimism bootstrap, used to estimate the drop-off in predictive performance so as to get overfitting-corrected estimates of predictive accuracy, the philosophy is that one attempts to estimate the difference in predictive accuracy of the fitted model evaluated on the training data and the true unknown predictive accuracy. The bootstrap estimates this because this difference (amount of overfitting) can be estimated by the difference between super-overfitting (evaluate accuracy on a bootstrap sample) and regular overfitting (evaluate accuracy of the model fitted on the bootstrap sample on the original sample).
Similar Posts:
- Solved – why bootstrap result in overfitting for randomForest prediction
- Solved – R randomForest has classification error of zero for different counts of a given class
- Solved – Interpreting output of importance of a random forest object in R
- Solved – Why does randomForest has higher test AUC than train AUC? Is this possible?
- Solved – Why does randomForest has higher test AUC than train AUC? Is this possible?