# Solved – Why do categorical predictor variables in regression need to be recoded as multiple predictors

I'm learning about machine learning using Python's library scikit learn, and in their tutorial here they mentioned about a categorical variable `color` which can have values `purple`, `blue` and `red`.

What is the reason behind using 3 boolean variables `color#purple`, `color#blue` and `color#red` instead of having the single variable `color` but mapping the values `purple`, `blue`, `red` to `1`, `2`, `3`?

Will doing either way have any effects on the regression fitting/prediction?

Contents

To elaborate on the answers of our colleagues above: say you map purple, blue, red to \$x = 1, 2, 3\$. Say \$x\$ represents the colour of a hat, and \$y\$ sales. Then if we have a model with an intercept, call it \$a\$ and the coefficient of \$x\$, call it \$b\$, then we'd be saying:

\$y = a + b x\$

We only get to choose one \$b\$ here, which has to cater for all the different colours. Imagine more blue hats are sold than purple hats, and more blue are sold than red, then our model suits the purple-blue relationship (\$1b<2b\$), but not the blue-red relationship \$2b<3b\$ !

If we use dummy variables then we might have a model like:

\$y = a + b_{mathrm{red}}x_{mathrm{red}} + b_{mathrm{purp}} x_{mathrm{purp}}\$

And this doesn't run into the same ordering problems as the first model. Note we only need two dummy variables if there is an intercept, as this becomes the baseline for blue.

Rate this post