I'm learning about machine learning using Python's library scikit learn, and in their tutorial here they mentioned about a categorical variable color
which can have values purple
, blue
and red
.
What is the reason behind using 3 boolean variables color#purple
, color#blue
and color#red
instead of having the single variable color
but mapping the values purple
, blue
, red
to 1
, 2
, 3
?
Will doing either way have any effects on the regression fitting/prediction?
Best Answer
To elaborate on the answers of our colleagues above: say you map purple, blue, red to $x = 1, 2, 3$. Say $x$ represents the colour of a hat, and $y$ sales. Then if we have a model with an intercept, call it $a$ and the coefficient of $x$, call it $b$, then we'd be saying:
$y = a + b x$
We only get to choose one $b$ here, which has to cater for all the different colours. Imagine more blue hats are sold than purple hats, and more blue are sold than red, then our model suits the purple-blue relationship ($1b<2b$), but not the blue-red relationship $2b<3b$ !
If we use dummy variables then we might have a model like:
$y = a + b_{mathrm{red}}x_{mathrm{red}} + b_{mathrm{purp}} x_{mathrm{purp}}$
And this doesn't run into the same ordering problems as the first model. Note we only need two dummy variables if there is an intercept, as this becomes the baseline for blue.
Similar Posts:
- Solved – How to interpret regression intercept with one dumthe coded categorical predictor
- Solved – Gaussian covariance matrix basic concept
- Solved – Converting logistic regression coefficient and confidence interval from log-odds scale to probability scale
- Solved – Do I need 3 RGB channels for a spectrogram CNN
- Solved – How to apply coefficient term for factors and interactive terms in a linear equation