I want to include time spent doing something (weeks breastfeeding, for example) as an independent variable in a linear model. However, some observations do not engage in the behavior at all. Coding them as 0 isn't really right, because 0 is qualitatively different from any value >0 (i.e. women who don't breastfeed may be very different from women who do, even those who don't do it for very long). The best I can come up with is a set of dummies that categorizes the time spent, but this is a waste of precious information. Something like zero-inflated Poisson also seems like a possibility, but I can't exactly figure out what that would look like in this context. Does anyone have any suggestions?
Best Answer
To expand a bit on the answer of @ken-butler. By adding both the continuous variable (hours) and an indicator variable for a special value (hours = 0, or non-breastfeeding), you think that there is a linear effect for the "non-special" value and a discrete jump in the predicted outcome at the special value. It helps (for me at least) to look at a graph. In the example below we model hourly wage as a function of hours per week that the respondents (all females) work, and we think that there is something special about "the standard" 40 hours per week:
The code that produced this graph (in Stata) can be found here: http://www.stata.com/statalist/archive/2013-03/msg00088.html
So in this case we have assigned the continuous variable a value 40 even though we wanted it to be treated differently from the other values. Similarly, you would give your weeks breastfeeding the value 0 even though you think it is qualitatively different from the other values. I interpret your comment below that you think that this is a problem. This is not the case and you do not need to add an interaction term. In fact, that interaction term will be dropped due to perfect collinearity if you tried. This is not a limitation, it just tells you that the interaction terms does not add any new information.
Say your regression equation looks like this:
$$ hat{y} = beta_1 weeks_breastfeeding + beta_2 non_breastfeeding + cdots $$
Where $weeks_breastfeeding$ is the number of weeks breastfeeding (including the value 0 for those that do not breastfeed) and $non_breastfeeding$ is an indicator variable that is 1 when someone does not breastfeed and 0 otherwise.
Consider what happens when someone is breastfeeding. The regression equation simplifies to:
$$ hat{y} = beta_1 weeks_breastfeeding + beta_2 0 + cdots \ = beta_1 weeks_breastfeeding + cdots $$
So $beta_1$ is just a linear effect of the number of weeks breastfeeding for those that do breastfeed.
Consider what is hapening when someone is not breastfeeding:
$$ hat{y} = beta_1 0 + beta_2 1 + cdots \ = beta_2 + cdots $$
So $beta_2$ gives you the effect of not breastfeeding and the number of weeks breastfeeding drops from the equation.
You can see that there is no use to add an interaction term, as that interaction term is already (implicitly) in there.
There is however something weird about $beta_2$ though, as it measures the effect of breastfeeding by comparing the expected outcome of those who do not breastfeed with those that breastfeed but do so only 0 weeks… It kind of makes sense in a "compare like with like" sort of way, but the practical usefulness is not immediatly obvious. It may make more sense to compare the "non-breastfeeders" with those women that were breastfeeding 12 weeks (approx. 3 months). In that case you just give the "non-breastfeeders" the value 12 for $weeks_breastfeeding$. So the value you assigning to $weeks_breastfeeding$ for the "non-breastfeeders" does influence the regression coefficient $beta_2$ in the sense that it determines with whom the "non-breastfeeders" are compared. Instead of a problem, this is actually something that can be quite useful.
Similar Posts:
- Solved – Time spent in an activity as an independent variable
- Solved – Interpretation of interaction effect in multiple regression
- Solved – Contrast to test significant interaction – why not include main effect
- Solved – Partial residual plot with interactions
- Solved – Why do we not interpret main effects if interaction terms are significant in ANOVA