Solved – How to prevent overfitting with regression using ranger (randomforest)

I use caret to train the model (on Boston dataset from the mlbench package).

Here is the code

set.seed(2) ind=sample(nrow(Boston),trunc(0.7*nrow(Boston))) train=Boston[ind,] test=Boston[-ind,]  # Fit lm model using 5 x 5-fold CV: model model <- train(  medv ~ ., train,  method = "ranger",  trControl = trainControl(    method = "repeatedcv", number = 5,    repeats = 5, verboseIter = F  ) ) 

When printing the model, it seems I have a good model

Random Forest   354 samples  13 predictor  No pre-processing Resampling: Cross-Validated (5 fold, repeated 5 times)  Summary of sample sizes: 282, 282, 285, 283, 284, 283, ...  Resampling results across tuning parameters:    mtry  splitrule   RMSE      Rsquared   MAE         2    variance    4.172443  0.8113023  2.702026    2    extratrees  4.574969  0.7819608  2.946490    7    variance    3.744418  0.8324785  2.475156    7    extratrees  3.812538  0.8342013  2.478945   13    variance    3.821406  0.8214275  2.517686   13    extratrees  3.795269  0.8282988  2.465104  Tuning parameter 'min.node.size' was held constant at a value of 5 RMSE was used to select the optimal model using the smallest value. The final values used for the model were mtry = 7, splitrule = variance and min.node.size = 5. 

When plotting the model, I get

enter image description here

However when I calculate

sqrt(mean((predict(model)-train$medv)^2)) #  1.487133 sqrt(mean((predict(model,newdata=test)-test$medv)^2)) # 2.648461 

I would like to know what I did wrong, and what do I have to do in order to improve the model. Thank you

You didn't do anything wrong.

The relevant comparison is test rmse (2.6) vs. the one obtained from cross-validation (3.8). So your model does even better on the hold-out test data than found by cross-validation. Possible reasons are the small sample size (i.e. luck) and spatial correlation across data lines.

Especially for random forests, it does not make much sense to compare insample performance (rmse 1.5) with validation/test performance as a random forest is very greedily overfitting the training data. Instead of looking at insample performance, for random forests you could consider the out-of-bag performance (an implicit approximation of the true performance). Since you are working with the meta-package caret, this information might not be directly available if optimizing with usual cross-validation though.

Similar Posts:

Rate this post

Leave a Comment