Solved – Is it necessary to do $k$-fold cross validation for decision trees in random forests

Consider the following data set train:

    z   a   b   c     0   1   40  185     0   1   25  128     0   0   32  100     0   0   29  100     1   1   30  107     0   0   30  133     1   1   38  132     1   1   37  127     1   0   30  184     1   0   40  199     1   1   26  185     0   1   21  185     0   0   21  134     0   0   20  137     1   1   22  135     0   0   23  189     1   0   32  109     1   0   31  152     1   0   38  130     1   1   37  191     0   1   39  168     1   0   28  183     0   1   26  171     1   1   23  164     0   1   32  111     0   0   34  131     1   0   30  121     1   0   27  195     1   1   29  117     1   0   26  187     1   0   34  183     0   0   28  189     0   1   34  150     0   1   34  176     0   1   24  140     1   0   37  181     0   1   36  109     1   0   39  198     0   0   32  164 

where z is a binary variable with predictors a,b,c. Suppose that there is some other test set with the same variables as the train data set and we want to predict z. For decision trees, is it better to use the full train data set to construct the tree? What would the purpose of $4$-fold cross validation be for example?

In a random forest, is $k$-fold cross-validation necessary? I thought you could use OOB error?

I can recommend this article dicussing good CV practice.

  • (A) When simply running one RF model: Yes OOB-CV is a fine estimate of your future prediction performance, given i.i.d. sampling. For many practical instances you don't have time nor need for anything more. A default RF model is simply good enough, you will first start fiddling with the hyper parameters later, if ever. I would spent more time wondering which prediction performance metric(AUC, accuracy, recall etc) best answered my questions. A little fiddling with hyperparameters (mtry,samplesize), does not make your OOB-CV vastly over-optimistic.

  • (B1) When comparing across classifiers (SVM, logistic, RF, etc.): you need to use the same CV regime, thus you cannot use OOB-CV only available for RF. Use e.g. 20-repeated, 10-fold CV, where all models are tested in the same folds(/by same partitions).

  • (B2) When performing gridSearch and variable selection To evaluate the predictive performance of each variant of your model you would probably use OOB-CV or some other CV. To unbiased estimate the overall performance, you need to wrap your model-selection process in a outer cross-validation, called nested-CV.

Similar Posts:

Rate this post

Leave a Comment