Solved – Random Forests for predictor importance (Matlab)

I'm working with a dataset of approximately 150,000 observations and 50 features, using SVM for the final model. To trim down the feature count, I decided to look into using RF so SVM optimization doesn't take too long. I'm currently using the TreeBagger implementation in Matlab and had a few questions.

  1. When investigating feature importances, should the RF be tuned for
    the highest CV performance? Does the accuracy of the model play into
    the accuracy of the reported predictor importances?
  2. What is the best way to deal with one of two correlated feature having it's importance underreported? Can this be cancelled out by training the RF multiple times and averaging the feature rankings?
  3. There doesn't seem to be any way to manually select split criterion
    in TreeBagger, nor could I find any documentation on what the
    default is. Does anyone know? If not, would it be safe to assume it's using Gini?
  4. How do the feature importances from TreeBagger compare to those
    generated by Matlab's fitensemble? This has support for bagging and
    different boosting algorithms, as well as different split criteria.
    But, as far as I know, these don't invoke Breiman's RF algorithm.
    The only Matlab function which does is TreeBagger, when specifying a
    number of features to sample. Please correct me if I'm wrong. As it stands, fitensemble is looking more attractive due to more options and better documentation.
  1. What you describe would be one approach. For classification, TreeBagger by default randomly selects sqrt(p) predictors for each decision split (setting recommended by Breiman). Depending on your data and tree depth, some of your 50 predictors could be considered fewer times than others for splits just because they get unlucky. This is why for estimation of predictor importance I usually set 'nvartosample' to 'all'. This gives a model with somewhat lower accuracy ensures that every predictor is sensibly included.

  2. If you run TreeBagger at the default settings, this is generally not a problem. For example, if you have two strongly correlated features and one of them is included in the 7 predictors selected at random for a split, the other is likely not included in these 7. If you want to adopt my scheme by inspecting all predictors for each split, use surrogate splits by setting 'surrogate' to 'all'. Training will take longer but you will have full information about predictor importance, irrespective of other predictors, associations among predictors.

  3. The TreeBagger doc and help have this statement at the bottom:

"In addition to the optional arguments above, this method accepts all optional fitctree and fitrtree arguments with the exception of 'minparent'. Refer to the documentation for fitctree and fitrtree for more detail."

Look at the doc for fitctree and fitrtree.

  1. fitensemble for the 'Bag' method implements Breiman's random forest with the same default settings as in TreeBagger. You can change the number of features to sample to whatever you like; just read the doc for templateTree. The object returned by fitensemble has a predictorImportance method which shows cumulative gains due to splits on each predictor. The TreeBagger's equivalent of that the DeltaCriterionDecisionSplit property (or something like that). In addition, TreeBagger has 3 OOBPermuted properties that are alternative measures of predictor importance.

Similar Posts:

Rate this post

Leave a Comment