I understand that in a linear regression model, the residual sum of squares will either remain same or fall with the addition of a new variable.

What if the two models were

$$

I colon y_i=beta_0 + beta_1 x_{1i}+epsilon_{i}

$$

and

$$

II colon y_i = beta_0 + beta_1 x_{1i} + beta_2 {x_{1i}}^{2} + epsilon_i

$$

Then, will the residual sum of squares of model 2 be less than or equal to that of model 1?

**Contents**hide

#### Best Answer

While it is true that your second model is technically also a linear regression model, I don't think that's the salient point here.

The point here is that $M_2(x) = M_1(x) + f(x|theta)$

in which $M_{1}(x)$ is "Model 1" and $M_{2}(x)$ is "Model 2".

Provided there exists a $theta$ such that $f(x|theta)=0 forall x$ (see comments for a discussion of this),

it must be the case, that on your training data, the sum of residuals for model 2 is **less than or equal to** your residuals for model 1.

Why is this the case? Imagine you wish to train Model 2, but you initialise it such that $M_{2}(x)=M_{1}(x)$. In this case, that means you choose $beta_{0}$ and $beta_{1}$ to take the values they take for $M_{1}(x)$, and you choose $theta$ s.t. $f(x|theta)=0 forall x$. You then start gradient descent. One of two things will happen. Either, gradient descent will decide that there is a combination of parameters which decreases the loss (your sum of residuals) further, or it will determine you're in a local minimum and exit (this is the case for which the sums of residuals are the same for both models). Remember that just because your training loss for your second model is less than or equal to that for model 1, doesn't mean it's a better model, you could be overfitting and your test loss could be worse for model 2.