I'm trying to use a random forest for regression. However, it does not perform well even on the training set, not to mention the test set.
I'm now wondering whether this is caused by bad quality input data, or if I can improve something in my approach?
Here is my data and model:
- 2 continuous input variables
- 1 categorical input variable
- 1 continuous output variable
I try to predict some environment-related data from some financial-related data (thus, it is not guaranteed that there is really a clean connection within the data!)
> Input 1 Min. 1st Qu. Median Mean 3rd Qu. Max. 60 52154 366902 9754180 2342790 341465729 > Input 2: Min. 1st Qu. Median Mean 3rd Qu. Max. 21 14043 89800 2600502 561641 108610665 > Input 3: Min. 1st Qu. Median Mean 3rd Qu. Max. 5.938e+06 2.924e+09 7.511e+09 1.842e+10 2.198e+10 2.828e+11 > Output: Min. 1st Qu. Median Mean 3rd Qu. Max. 0 8032 282167 2638721 2048726 68796039
Formula (caret package):
control <- trainControl(method="repeatedcv",number=10, repeats = 3, verbose = TRUE, savePredictions = TRUE) fit<-train(x=train_parametres, y=train_result, data=analyse, method="rf", trControl=control, importance=TRUE, allowParallel=TRUE, ntree=2000 )
Output (results for "fit" from cross-validation on training set):
Random Forest 324 samples 3 predictor No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 292, 292, 292, 291, 292, 291, ... Resampling results across tuning parameters: mtry RMSE Rsquared 2 4983092 0.5596401 3 5128162 0.5452369 RMSE was used to select the optimal model using the smallest value. The final value used for the model was mtry = 2.
Changing nodesize, ntree and mtry does not alter the results very much.
Is it thus a problem of data quality, or are there other ways to improve the model I have overlooked, e.g. through data normalization? To my knowledge, it should at least be possible to overfit the model and get better results on the training set.
Since you've already explored changing
mtry, you're left with two possible explanations for the low R2:
- Most of the variation in your data is random i.e. not explainable by your predictors.
- You have insufficient data .
Unfortunately, there's not a lot we can recommend based on the information you've presented. Random forests are structurally reasonably robust to overfitting because of bagging (but see the side note below), so I wouldn't be surprised if you can't push that R2 higher.
Side note: At
mtry = 3, you're using all your predictors at every split. Since you have only 3 predictors, that negates one of the ways that random forests work: 'feature bagging', or the 'random subspace method'. This is explained well in this answer.