Solved – Why do categorical predictor variables in regression need to be recoded as multiple predictors

I'm learning about machine learning using Python's library scikit learn, and in their tutorial here they mentioned about a categorical variable color which can have values purple, blue and red.

What is the reason behind using 3 boolean variables color#purple, color#blue and color#red instead of having the single variable color but mapping the values purple, blue, red to 1, 2, 3?

Will doing either way have any effects on the regression fitting/prediction?

To elaborate on the answers of our colleagues above: say you map purple, blue, red to $x = 1, 2, 3$. Say $x$ represents the colour of a hat, and $y$ sales. Then if we have a model with an intercept, call it $a$ and the coefficient of $x$, call it $b$, then we'd be saying:

$y = a + b x$

We only get to choose one $b$ here, which has to cater for all the different colours. Imagine more blue hats are sold than purple hats, and more blue are sold than red, then our model suits the purple-blue relationship ($1b<2b$), but not the blue-red relationship $2b<3b$ !

If we use dummy variables then we might have a model like:

$y = a + b_{mathrm{red}}x_{mathrm{red}} + b_{mathrm{purp}} x_{mathrm{purp}}$

And this doesn't run into the same ordering problems as the first model. Note we only need two dummy variables if there is an intercept, as this becomes the baseline for blue.

Similar Posts:

Rate this post

Leave a Comment