Solved – Optimal binning methods for categorical variables

I'm running a multinomial logit to predict the outcome of a categoric response variable.
I have both continuous and categoric independent variables, and I know it's bad practicde to bin the continunous ones.
For the categoric, however, I've seen it's very used (and it makes sense, specially since I have a lot of observations) to make dummies out of the n-1 categories. I also don't have too many categories in each categoric variable (less than 20), and not many categoric variables in total.

But I was wondering which is better: to simply make dummies for the n-1 categories, or to create bins using any binning method, like IV, WoE, Chi square, KS?

Intuitively I feel like creating dummies would be better since you're capturing precisely the effect of each category, while with the bins you're always losing a bit of predictive power because you're joining them.

Note there's already a number of good questions on this topic, like this one or this one.

In those, some alternative options to one-hot-encoding (aka dummy variables) are mentioned. These include to allow for partial pooling (i.e. especially rare classes would be pulled towards being an average category, while large classes that do have a lot of data showing different behavior than other classes end up being "alllowed to be different") using random effects.

Another option that's also mentioned in one of the answers is target encoding. This is also very popular on Kaggle. To adapt this for multinomial logistic regression, you could in the regression equation for each target category have a covariate that target encodes that target category. I.e. you create a covariate that captures for what proportion of cases this class level has resulted in this target category, but you shrink these proportions towards the overall proportion (and possibly even the proportion for each target category towards an equal proportion for all target categories). The extent of shrinkage is a tunable hyperparameter.

If there's some sensible way of doing it, you could also consider finding a suitable embedding for the features. E.g. one can create these using some pre-training task (e.g. like one creates Word2Vec/GloVe etc. word embeddings), or one could project properties of the classes (assuming you know them) into a low dimensional space using some technique like UMAP or t-SNE. However, both of those options require some extra information and might not always be possible. If you train a neural network for tabular data, you could of course create (or fine-tune) these as you train on your target task – or even train on that task and then use these embeddings in another model.

Similar Posts:

Rate this post

Leave a Comment