# Solved – Dumthe coding vs. continuous variable in regression analysis

I am doing a regression analysis in R, in which I examine the contribution of each car attribute to its price.

Some variables can be coded as a dummy variable, or as a continuous variable.
For example, I can add a dummy variable for each number of cylinder (2, 4, 6 or 8), or I can consider this as a continuous variable.

Is there a difference between the two possibilities?

Contents

Regressing price \$y\$ on a constant and the number of cylinders \$x\$ would make sense if the price was known to be affine in the number of cylinders: the price increase from 2 to 4 cylinders is the same as the price increase from 4 to 6 cylinders and is the same as the price increase from 6 to 8. Then you could run the regression:

\$\$ y_i = a + b x_i + epsilon_i \$\$

On the other hand, it may not be affine in reality. If price isn't affine in number of cylinders, the above model would be misspecified.

What could one do? Let \$z_2\$ be a dummy variable for two cylinders, let \$z_4\$ be a dummy variable for 4 cylinders, etc… Since there are only four possibilities (2,4,6, or 8 cylinders), you likely have enough data to run the more complete regression:

\$\$ y_i = a + b_4 z_{4,i} + b_6 z_{6,i} + b_8 z_{8,i} + epsilon_i\$\$

Here the coefficients would \$b_4\$, \$b_6\$ etc… would be the price increase relative to a 2 cylinder car. (the constant \$a\$ would pick up the mean price of a two-cylinder car.)

Or if you run the regression without a constant, you could run:

\$\$ y_i = b_2 z_{2,i} + b_4 z_{4,i} + b_6 z_{6,i} + b_8 z_{8,i} + epsilon_i\$\$

Here the coefficients (\$b_2\$, \$b_4\$, \$b_6\$, \$b_8\$) would be the mean price of each cylinder type. Observe how the average price no longer is assumed to be affine in the number of cylinders! You could have a small difference between \$b_4\$ and \$b_6\$ but a large difference between \$b_6\$ and \$b_8\$.

Rate this post