Can anyone point me towards a good explanation of when a residualized variable in a regression will give you the same answer as using a non-residualized variable with controls?
For instance, say I want to know the effect of a variable $x$ on $y$ and need to control for $a$ and $b$. In a classic linear model framework I can either add $a$ and $b$ as covariates (i.e., control variables) to the model of $y$ on $x$, or I can first regress $x$ on $a$ and $b$, and then use the residuals from this regression (the residualized $x$) to predict $y$. Both will give the same coefficient for $x$.
This works in the linear model case, but does a residualized $x$ give the same coefficient as $x$ with controls for other types of models, e.g., logit models or Poisson models? My own simple simulations suggest they do not (see R code below), but I am trying to understand why, and if residualization can ever be used in place of adding controls outside of the linear model framework. Can anyone point me towards a good explanation?
#generate the data n=10000 set.seed(3345) a=rnorm(n); b=rnorm(n) x = .4*a + .4*b*b + rnorm(n) y = .5*x + .3*a + .3*b*b + rnorm(n) ## LINEAR MODEL #### #a model with controls gets the right coefficient summary(lm(y ~ x + a + I(b^2))) residmod=lm(x ~ a + I(b^2)) x.resid=resid(residmod) #using a residualized variable gets the same coefficient summary(lm(y ~ x.resid)) ## LOGIT MODEL #### y=.5*x + .3*a + .3*b*b + rlogis(n) ydichot=ifelse(y >0, 1, 0) #a model with controls gets the right coefficient summary(glm(ydichot ~ x + a + I(b^2), family=binomial)) #using a residualized variable does NOT get the same coefficient summary(glm(ydichot ~ x.resid, family=binomial)) ## POISSON MODEL #### mu=exp(.5*x + .3*a + .3*b*b) ycount=rpois(n, mu) summary(glm(ycount ~ x + a + I(b^2), family=poisson)) #using a residualized variable does NOT get the same coefficient summary(glm(ycount ~ x.resid, family=poisson))
Best Answer
Residualization can be used outside the linear framework. For a direct use of residualization on nonlinear probability models, you could consult this paper: http://smx.sagepub.com/content/42/1/286. It explains what residualization does in nonlinear probability models.
Residualization is also used in nonparametric regression.