Interaction terms are sometimes added to linear regression models when the effect of one variable depends on the value of another variable. But will the inclusion of such interaction terms increase the model's predictive power? Or is the only effect to yield a model that can be better interpreted?
Or put another way, if I only care about the model's performance, and otherwise treat it as a black box, do I need to think about interaction terms?
Best Answer
One way to think about fitting a regression model is that we start with a set of possible functions relating the input variables to the output. This set is called the hypothesis space, and it contains functions corresponding to all possible choices of parameters. We then use the data to choose a function from this set (e.g. by minimizing a loss function like the squared error).
For linear regression, we have the hypothesis space $mathcal{F}_1$, the set of all functions whose output is a linear combination of the input variables. Including interaction terms gives the hypothesis space $mathcal{F}_2$, the set of all functions whose output is a linear combination of the input variables and their interaction terms. Note that $mathcal{F_1}$ is a subset of $mathcal{F}_2$. That is, every function in $mathcal{F}_1$ is also in $mathcal{F}_2$ (because we can always set the coefficients for the interaction terms to zero), but $mathcal{F}_2$ contains functions that are not in $mathcal{F}_1$. This means that including interaction terms gives us the possibility of fitting a wider variety of functions. In particular, $mathcal{F}_2$ contains functions that are nonlinear with respect to the original input variables.
When fit to a particular dataset, a model that includes interaction terms must fit at least as well as a model that does not. This follows from the fact that, if the model with lowest error on the training data is in $mathcal{F_1}$, it's also in $mathcal{F}_2$, as above. But, whether this yields an increase in predictive power depends on the problem. Including interaction terms can increase predictive power in some settings, but decrease it in others.
Predictive power is a measure of how well we can predict unseen data drawn from the same distribution as the data used to fit the model. On one hand, including interaction terms lets us fit a wider variety of functions, as above. So, if the true, underlying function includes these terms (or is closer to a function that does), then we have the potential to find it (or better approximate it). On the other hand, using a larger hypothesis space means that we're better able to fit noise in the training data, so the risk of overfitting is greater. For a more technical description, see the bias-variance tradeoff; including interaction terms decreases the bias, but increases the variance. Predictive power is determined by a combination of these two, opposing effects. How including interaction terms affects predictive power depends on the true/underlying function, the noise, and the amount of data available.
Similar Posts:
- Solved – What does improper learning mean in the context of statistical learning theory and machine learning
- Solved – Sign of Estimate (Coefficient) of Interaction Terms
- Solved – Size of the Hypothesis Space
- Solved – the meaning of the beta for the interaction between continuous variables in a linear mixed-model
- Solved – Could anyone explain the terms “Hypothesis space” “sample space” “parameter space” “feature space in machine learning with one concrete example