Solved – categorical_encoding in h2o – what is the difference between the options

I'm trying to understand the pros/cons and when to use the various encoding options that are available to me in h2o with the parameter 'categorical_encoding'.

It would be helpful if people could point out general rules of thumb on how to use this.

Typically I use the 'Enum' value because I like how all categorical values are grouped together when looking at feature importance. On the other hand, xgboost's default value is 'label-encoder' I believe, which breaks things up by categorical level/value.

Unfortunately, I don't really know where to begin or questions to ask around these other values available:

  • one hot internal
  • one hot explicit
  • sort_by_response
  • enum_limited
  • enum
  • label encoder

Again, I primarily stick with enum, sometimes label-encoder, but honestly I don't know practical implications of these various options. Would love a generalized understanding of when one might be better than other from someone knowledgeable !

The basic rule of thumb is: don't touch it. Keep the default.

The next guideline would be: if you try something different to the default, do it within a grid search, so you can quantify the effect it has.

The documentation seems fairly good on discussing the differences, in regard to each algorithm: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html

label_encoder is unlikely to be good, as it assigns the underlying arbitrary numeric value to the category, and treats it as a number.

So for favourite colour, blue might be 1, red might be 2, orange might be 3, and pink might be 4. The deep-learning algorithm (or whatever) will then be thinking that red is twice blue, pink is twice red, and orange is three times blue. So, if it learns anything, it is learning nonsense.

The main reason to choose it over one-hot encoding would be if you have so many categorical levels that it is creating hundreds or thousands of extra inputs.

(Personally, I'd either throw away such a column, or do some domain-specific data-engineering on it to reduce the levels (*), rather than use label_encoder)

*: an example is U.S. zip codes: reduce the thousands of possibilities by using a lookup table to turn them into a U.S. state; or substitute in some demographic information about that zip code.

Similar Posts:

Rate this post

Leave a Comment