# Solved – Why use MSE instead of SSE as cost function in linear regression

I am studying linear regression and I solved some problems analytically.
For that I used the normal and intuitive sum of squared error function.

Looking at this function, it makes all sense why it looks the way it does. I mean we have the square because we can have positive and negative errors and we do not want that they eliminate each other and we use the square, because we get exactly one minima, which is a solvable minimization problem.
We can use also gradient descent.

Looking at other cost functions, like the mean squared error, I do not understand why it is a good idea to calculate the mean of errors, I mean what motivates us to do this?

I mean it is just an optimization problem, we want the best fit, so I would never get the idea to use the mean of errors and why it makes our optimization problem better.

Is it just to work with a smaller error values, so when using gradient descent it converts faster?

would be great if someone could help me with the intuition and the mathematical motivation.

Contents

It is a fact from calculus that some function $$f(x)$$ and $$cf(x)$$ have the same argmin ($$x$$ that minimizes the function), unless $$c=0$$. It follows that the following all have the same argmin, thus give the same parameter estimates.

$$sum(y_i – hat y_i)^2\ dfrac{sum(y_i – hat y_i)^2}{n}\ dfrac{sum(y_i – hat y_i)^2}{n-p}\ dfrac{sum(y_i – hat y_i)^2}{8}\$$

The first is the usual sum of squared errors. The next two are variants of mean squared error (the $$n-p$$ denominator has to do with getting an unbiased estimate of the variance of the error term). I made up the final one.

However, all of these give the same parameter estimates (barring numerical issues coming from doing math on a computer).

Mean squared error has the advantage of giving some sense of by how much predictions and true values differ (though this is not perfect, since it is not absolute error), and it has a relationship to the variance of the error term. Further, you do not make the value arbitrarily large by having many observations.

Rate this post