# Solved – Are mathematically-coupled variables spurious predictors in linear regressions

The correlation between two ratios with the same denominators are spurious. Similarly, the correlation between two mathematically-coupled variables could also be spurious. Are mathematically-coupled variables (e.g. age and income are used to derive two formulas, which are then used as Y and X in linear regression analyses) spurious predictors in linear regressions?

Contents

In the case of a regression equation in which the independent and dependent variables have a common component, you can frequently re-write the equation to demonstrate there is a correlation between the independent variable with the common component and the error term in the regression equation. For an example besides ratios, I give one in another answer on the site in regards to including the baseline as a control variable when the dependent variable is the change score from baseline (hence the depedent variable is "mathematically coupled" with an independent variable). This example is much more innocuous though because of the exact linear relationships between the variables, whereas ratio's of variables can have more negative consequences on interpretation (and other estimated parameters) because they can not be re-expressed as different linear combinations.

So, it is best to avoid the case when the indepedent and dependent variables are "pre-processed" in a way that induces some type of dependency between them (and this seems to be what Pearson initially meant when he referred to the term "spurious" correlation, Aldrich, 1995). In terms of ratio's, Kronmal (1993) even suggests that ratio variables should always be avoided (even if it is just a single independent variable as a ratio), as the specific functional form of the relationship specified is more restricted than if the two variables (and their interaction) is included in the regression equation. This still sometimes leaves room for potential theoretical decisions to guide whether to use ratio variables, but in many observational studies in the social sciences it is more reasonable to avoid ratio variables than it is to assume the more specific functional form of the relationship in terms of the ratio between the two variables (Firebaugh, 1985).

I don't see why these arguments don't apply to any type of "mathematically coupled relationship", and hence I suspect it is much easier to interpret the original components uniquely than it is to interpret them together. Another similar line of thinking and other illustrative examples are given in another related question on the site, Including the interaction but not the main effects in a model. In that thread whuber and wolfgang both give examples that counter this argument though.

Just for futher illustration purposes, I will give an a recent example from some of my work. I was working on a project that included a multi-wave panel survey which included several likert scales measured at each wave. One theoretical model my co-author explicated was that a specific outcome at Wave 2 was the effect of both the baseline likert scale score at wave a and the change in the likert scale score from wave a to b. So this could be represented by the model;

$$Y = alpha + beta_1{L_a} + beta_2{(L_b-L_a)} + epsilon$$

Where $$L_b$$ is the likert score at wave b, and $$L_a$$ is the likert score at wave a. Subsequently, this above equation is difficult to interpret, because the model can be equivalently written as;

$$Y = alpha + (beta_1 – beta_2){L_a} + beta_2{(L_b)} + epsilon$$

Again this is innocuous as it doesn't affect the estimation of other parameters in the model, but the use of variables that are just re-expressions of one another brings with it difficulty in interpretation.

Rate this post