If I have a factor e.g. sexe with two levels MALE and FEMELLE let's say, using rpart alone I get splits that say for example Sexe = Male and then a yes no split. However using rpart with caret I get a weird renaming of variables:
this also causes a problem with the predict function as now my variable isn't called sexe anymore but sexeMALE. Is there a way around this? Also it's a factor variable what does >=.5 mean in this case?
Thanks
Best Answer
You probably used the formula method with train
which converts the factors to dummy variables. Most functions in R that use the formula method do the same. rpart
, randomForest
, naiveBayes
and a few others do not since they are able to model the categories without needing numeric encodings of that data.
The naming that you see is what is generated by model.matrix
.
If you want to keep the factors as factors, use the non-formula method, e.g.
train(x, y)
Max
Similar Posts:
- Solved – Using Rpart to find which factor influence the outcome the most
- Solved – Decision trees in smaller datasets
- Solved – Decision trees in smaller datasets
- Solved – Predictions for rpart model require more variables than shown in the classification tree
- Solved – R Formula that only uses a subset of a factor