I have a random forest being applied to 7 different input variables to predict a particular classification. I've done a grid search on the hyperparameters mtry
and ntree
and it seems as though the algorithm is most accurate when mtry
is at 6 (the highest value for mtry
I allowed as a hypothetical value in my search). This finding has also been confirmed when applied to a test set. I have an intuition that mtry
should always be less than the number of total variables in the model but I can't find anything that explicitly states this.
My question Is there an upper limit to mtry
as I think there should be? And if indeed that's the case, what would it indicate if my model gets more accurate as I approach that upper limit? Is that something to be concerned about?
Best Answer
mtry
is indeed bound by the number of variables in your model, as it specifies the size of the variable subset that is randomly picked for each random forest iteration.
Given that it is a hyper-parameter, there is no way to know ahead of time what would be the best value of mtry
. However, values of mtry that are close to the total number of variables in the model may weaken the forest by making the individual decision trees more correlated; when the decision trees consider similar sets of variables to split on, they are more likely to be similar, even if each is fit to a different bootstrapped data set. Ensemble models usually strive for independence of their members, as that improves predictive ability.
Despite these concerns, the value picked by through proper validation, even if it is large, is probably a fair value to consider for a model.