This may be an obvious/basic random forest question, but here goes..
Given the Iris
dataset we tried two different number of trees. Here are the results for 50.
Notice in particular that the setosa
was ostensibly classified correctly with 36 observations: i.e. zeros on its non diagonals of the confusion matrix:
fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=50) fit Call: randomForest(formula = f, data = iris_train, proximity = TRUE, ntree = 50) Type of random forest: classification Number of trees: 50 No. of variables tried at each split: 2 OOB estimate of error rate: 3% Confusion matrix: setosa versicolor virginica class.error setosa 36 0 0 0.00000000 versicolor 0 33 1 0.02941176 virginica 0 2 28 0.06666667
Now let us try an unreasonably small number of trees – five.
Notice that the setosa
was chosen as the class 32 times (vs 36) – yet the classification error for it is still zero?
fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=5) print(fit$importance) MeanDecreaseGini sepal_length 1.735648 sepal_width 1.939250 petal_length 28.977475 petal_width 33.199627 print(fit) Call: randomForest(formula = f, data = iris_train, proximity = TRUE, ntree = 5) Type of random forest: classification Number of trees: 5 No. of variables tried at each split: 2 OOB estimate of error rate: 6.82% Confusion matrix: setosa versicolor virginica class.error setosa 32 0 0 0.00000000 versicolor 0 29 1 0.03333333 virginica 0 5 21 0.19230769 >
I am missing something basic here: how can the the number of chosen instances for a particular class vary yet the classification error remain unaffected?
Best Answer
Setosa is simply easily separated. Look at petal length vs petal width, for example. You can draw a box that will be entirely setosa and not leave any setosa out. RF is learning the shape of that box. That's why it's never misclassified. Conversely, in the case of the other two classes, no such box can be drawn — either it won't include all of a species, or it will also include some of another species' points. So that's the source of error for the other two classes: many boxes must be drawn, to progressively improve the purity of the resulting split. Those boxes will not have information about the out-of-sample points, so some of the boxes will inadvertently generalize poorly. The distance of Setosa from the rest of the classes means that a large number of alternative boxes are effective, which mitigates this effect.
Similar Posts:
- Solved – H2O’s deep learning on R has a confusion matrix problem…
- Solved – Validating the CART model in R
- Solved – how to perform classification using function train in caret in R
- Solved – Interpreting output of importance of a random forest object in R
- Solved – Increasing sample size with bootstrap sampling