Solved – Does it make sense to use residuals as an independent variable

I recently ran a regression of the following form:

``mod <- lm(log(y) ~ log(x)) ``

To examine how y scales as a function of x. I then examined the top 20 (super-linear scaling with respect to predicted trendline) and bottom 20 residuals (sub-linear scaling with respect to predicted trendline), and noticed that all these observations shared startling regularity in a third variable, `z`. (all the positive residuals have very large values of `z` and all the negative residuals have very small values of `z.`)

I want to be able to demonstrate that this is a meaningful, statistically significant pattern. My intuition was to run

``lm(z ~ resid(mod)) ``

but this strikes me as wrong. Is there a way to capture this pattern using residuals, or is this the wrong way of thinking about it altogether?

Re: Heteroskedastic Standard Errors

Here is a QQ plot:

Residuals vs. Fitted Values

The second graph in particular is somewhat alarming. Should I be worried about heteroskedastic standard errors with a huge dataset (300,000 obs), if I'm using robust SEs

Contents

Well, you want to understand how the residuals could depend on `z`. So the model you should look at would be

``lm( resid(mod) ~ z ) ``

but maybe first (show us) the corresponding plot. But, maybe what you see is that the spread of the residuals depend on `z` not its mean (heteroskedasticity). Then try a model like

``lm( Id(resid(mod)^2) ~ z ) ``

(or replace the square with absolute value.) If this turns out in the confirmative, maybe try a more complex model, a `gamlss` which permits simultaneous estimation of mean and variance functions. You could look through for some examples.

Rate this post