In Applied Predictive Modeling by Kuhn and Johnson the authors write:

Finally, these trees suffer from selection bias: predictors with a

higher number of distinct values are favored over more granular

predictors (Loh and Shih, 1997; Carolin et al., 2007; Loh,

2010). Loh and Shih ( 1997) remarked that“The danger occurs when a

data set consists of a mix of informative and noise variables, and the

noise variables have many more splits than the informative variables.

Then there is a high probability that the noise variables will be

chosen to split the top nodes of the tree. Pruning will produce either

a tree with misleading structure or no tree at all.”Kuhn, Max; Johnson, Kjell (2013-05-17). Applied Predictive Modeling

(Kindle Locations 5241-5247). Springer New York. Kindle Edition.

They go on to describe some research into building unbiased trees. For example Loh's GUIDE model.

Staying as strictly as possible within the CART framework, I'm wondering if there's anything I can do to minimize this selection bias? For example, perhaps clustering/grouping high cardinality predictors is one strategy. But to what degree should one do the grouping? If I have a predictor with 30 levels should I group to 10 levels? 15? 5?

**Contents**hide

#### Best Answer

Based on your comment I'd go with a conditional inference framework. The code is readily available in R using the ctree function in the party package. It has unbiased variable selection, and while the algorithm underlying when and how to make splits is different compared to CART, the logic is essentially the same. Another benefit outlined by the authors (see the paper here) is that you don't have to worry so much about pruning the tree to avoid overfitting. The algorithm actually takes care of that by using permutation tests to determine whether a split is "statistically significant" or not.

### Similar Posts:

- Solved – Which Machine Learning book to choose (APM, MLAP or ISL)
- Solved – Feature selection in GBM
- Solved – Combining ordinal and categorical (one-hot encoded) variables in one model
- Solved – Combining ordinal and categorical (one-hot encoded) variables in one model
- Solved – Combining ordinal and categorical (one-hot encoded) variables in one model