Solved – dumthe vs one-hot encoding – ML for prediction

I understand there is a lack of consensus in the difference (if any) between one-hot (k variables) and dummy (k – 1 variables) encoding from a k-level factor.

The caret package seems to auto-encode factors from my limited usage of it thus far. glmnet, in contrast, doesn’t and one needs to run the model.matrix function to do this in pre-processing.

Can someone help to clarify for me whether it's more appropriate to use dummy (with or without intercept) or one-hot in ML algorithms where prediction is the main priority.

``# Consider a 3-level factor 'dat\$thal' ``

Dummy encoding – option 1

``# Here we remove the intercept with [,-1] > x <- model.matrix( ~ ., dat)[,-1] > x <- data.frame(x) > str(x) 'data.frame':   180 obs. of  2 variables:  $$thal2: num 0 0 0 0 0 0 0 1 0 0 ...$$ thal3: num  0 0 0 1 1 0 1 0 1 0 ... ``

Dummy encoding – option 2

``# Here we leave the intercept > x <- model.matrix( ~ ., dat) > x <- data.frame(x) > str(x) 'data.frame':   180 obs. of  3 variables:  $$X.Intercept.: num 1 1 1 1 1 1 1 1 1 1 ...$$ thal2       : num  0 0 0 0 0 0 0 1 0 0 ...  \$ thal3       : num  0 0 0 1 1 0 1 0 1 0 ... ``

One-hot encoding – option 3

``# Here we encode all levels (no intercept) > x <- model.matrix( ~ .+0, dat) > x <- data.frame(x) > str(x) 'data.frame':   180 obs. of  3 variables:  $$thal1: num 1 1 1 0 0 1 0 0 0 1 ...$$ thal2: num  0 0 0 0 0 0 0 1 0 0 ...  \$ thal3: num  0 0 0 1 1 0 1 0 1 0 ... ``
Contents

Firstly, you certainly do not need to add explanatory vectors that are linear combinations of existing explanatory vectors in the model. This leads to identifiability problems, and —at best— these will be handled by the algorithm ignoring one of your inputs. Thus, when working with a factor variable with $$k$$ categories, you would either use $$k$$ indicators and no intercept term, or use $$k-1$$ indicators with an intercept term. There are some advantages and disadvantages to both methods, depending on what you want to do.

Using $$k$$ indicators and no intercept term: With this method the coefficients corresponding to each of the indicators in your model are interpreted as absolute effects, and are not relative to any base category. This can be useful if you want to plot all of the estimated coefficients for the categories, and you don't want any of the estimated effects to be forced to a baseline of zero. Sometimes you want to see the estimated "total effect" for a particular category with its associated confidence interval, and this is easiest to obtain if you fit the model with this method. (You can still get it from the other method, but it requires some mucking around.)

The down-side of this method is that you have to be very careful when looking at ANOVA outputs and other outputs that compare your model to a null model. Since your specified model has no intercept term, these outputs will generally compare your model to a null model with no intercept term, so that the null model is effectively just white noise. This means that your ANOVA outputs and other similar outputs give comparisons to a really shitty model and the apparent success of your model will appear overstated.

As an example, suppose you build a regression model of height using sex as the explanatory variable, and you code it so that there is no intercept, but there are two indicators, for males and females. In this case, plotting your estimated coefficients for the two categories has a simple and natural interpretation — they each represent estimates of the mean height of that sex. This is a nice aspect of the method. However, if you look at the ANOVA output for the model, it will be made against a null model that postulates heights of all people as white noise (with zero mean) rather than as having a non-zero mean but with no sex difference. This means that your ANOVA outputs will be very misleading.

Using $$k-1$$ indicators plus an intercept term: This is the opposite case, so the advantages and disadvantages are reversed. Under this method, the coefficients for categories of your factor variable represent relative effects, which are differences in effect size between the present category and the baseline category. This can be useful if you want to look at relative effects, but it means that you do not have direct access to the absolute effects, and these take a bit of mucking around to obtain.

The main upside of this method is that ANOVA output and other model-comparisons will compare your model to a null model with an intercept term, which is usually the comparison you want to make in these cases. This means that the outputs of these model comparisons will show the success or failure of your model against a baseline model that does not assume a zero mean for the response variable.

Continuing the above example, suppose you now build a regression model of height using sex as the explanatory variable, and you code it so that there is an intercept, and then an indicator for females (with males as the baseline category). In this case, plotting your estimated coefficients has a less simple and natural interpretation — you get an estimate of the average height of all people, and an estimate of the mean difference in the height of males and females. This is probably not the ideal presentation of that information, so this is a sub-optimal aspect of this method. On the other hand, if you look at the ANOVA output for the model, it will be made against a null model that allows a non-zero mean height but with no sex difference. This means that your ANOVA outputs will give a useful comparison to a simple base model.

Rate this post