Solved – Decision tree : handle attribute with many nominal values

I would like to build a decision tree from a training data. I have an attribute with many nominal values. For example, the department name attribute has about 20-30 values. I would like to group these values to 4-5 groups or any appropriate amount. How can I do this (in Weka is preferable)? Should I do this in preprocessing phase or/and learning phase?

That is an interesting question. Virtually all tree growing methods (to mention just 3 classic – CHAID, CRT, QUEST) do merging of predictor categories which, as predictors, behave similarly. So, during the analysis you acquire optimal binning you ask for.

However, in CHAID or CRT [=CART] such process of category merging at each branch level is synchronous with the selection of the best predictor variable at that level. Due to this, variables with initially more categories get a handicap to be selected as a best predictor simply by chance, – because with many categories they just produce more combinatoric variants of merging which to test.

QUEST is insured against that bias and is quick. But QUEST has its own limitations (nominal DV; no ordinal IVs are usually supported; only dichotomous splitting available).

So, if QUEST won't suit you, you may do CHAID or CRT first for preprocessing, to merge categories optimally: perform analyses one at a time separately with every IV (which categories you want to merge) as either the only predictor or the predictor forced to be the first-level predictor. When you're through, and categories are merged, you may turn to the main analysis of building a tree with all the IVs.

Similar Posts:

Rate this post

Leave a Comment