I use caret
to train the model (on Boston dataset from the mlbench
package).
Here is the code
set.seed(2) ind=sample(nrow(Boston),trunc(0.7*nrow(Boston))) train=Boston[ind,] test=Boston[-ind,] # Fit lm model using 5 x 5-fold CV: model model <- train( medv ~ ., train, method = "ranger", trControl = trainControl( method = "repeatedcv", number = 5, repeats = 5, verboseIter = F ) )
When printing the model, it seems I have a good model
Random Forest 354 samples 13 predictor No pre-processing Resampling: Cross-Validated (5 fold, repeated 5 times) Summary of sample sizes: 282, 282, 285, 283, 284, 283, ... Resampling results across tuning parameters: mtry splitrule RMSE Rsquared MAE 2 variance 4.172443 0.8113023 2.702026 2 extratrees 4.574969 0.7819608 2.946490 7 variance 3.744418 0.8324785 2.475156 7 extratrees 3.812538 0.8342013 2.478945 13 variance 3.821406 0.8214275 2.517686 13 extratrees 3.795269 0.8282988 2.465104 Tuning parameter 'min.node.size' was held constant at a value of 5 RMSE was used to select the optimal model using the smallest value. The final values used for the model were mtry = 7, splitrule = variance and min.node.size = 5.
When plotting the model, I get
However when I calculate
sqrt(mean((predict(model)-train$medv)^2)) # 1.487133 sqrt(mean((predict(model,newdata=test)-test$medv)^2)) # 2.648461
I would like to know what I did wrong, and what do I have to do in order to improve the model. Thank you
Best Answer
You didn't do anything wrong.
The relevant comparison is test rmse (2.6) vs. the one obtained from cross-validation (3.8). So your model does even better on the hold-out test data than found by cross-validation. Possible reasons are the small sample size (i.e. luck) and spatial correlation across data lines.
Especially for random forests, it does not make much sense to compare insample performance (rmse 1.5) with validation/test performance as a random forest is very greedily overfitting the training data. Instead of looking at insample performance, for random forests you could consider the out-of-bag performance (an implicit approximation of the true performance). Since you are working with the meta-package caret
, this information might not be directly available if optimizing with usual cross-validation though.
Similar Posts:
- Solved – How to use principal components as predictors in linear regression
- Solved – misclassification rate? How do we calculate it
- Solved – Random Forest % Var explained OOB output differs from test dataset results
- Solved – xgboost overfitting
- Solved – R: What does train() do when it calculates ridge regression?