Updated question:
Why do we use RMSE:
$$RMSE = sqrt{frac{1}{n}Sigma_{i=1}^{n}{Big(hat{y}_i -y_iBig)^2}}$$
Why is it not MRSE:
$$MRSE = frac{1}{n}sqrt{Sigma_{i=1}^{n}{Big(hat{y}_i -y_iBig)^2}}$$
I understand that other methods (e.g., MAE and MAPE) can be used as a metric for error. My question is specifically about why we use RMSE over MRSE.
Original:
Why is the equation for RMSE:
$$RMSE = sqrt{frac{1}{n}Sigma_{i=1}^{n}{Big(hat{y}_i -y_iBig)^2}}$$
Why is it not:
$$RMSE = frac{1}{n}sqrt{Sigma_{i=1}^{n}{Big(hat{y}_i -y_iBig)^2}}$$
What is the reason for taking the square root of 1/n?
Best Answer
While Demetri's answer gives a very good derivation or RMSE, it doesn't really explain why not the other method you suggest. I think you can get a little more insight by observing that MRSE is not a valid name for your suggested measure. Look closely and the steps are
- Square the residuals
- Add them up
- Square root
- Divide by the number of samples
A "mean" needs to have the sum and the divide consecutive. So the MRSE would actually be:
$$ MRSE = frac{1}{n} sum sqrt{(hat{y}_i – y_i)^2} = frac{1}{n}sum |hat{y}_i – y_i| = MAE$$
So, RMSE is the square-root of a mean – it is then just transformed (by square root) for convenience. The MAE is itself a mean. What you have created, isn't a mean – you are not adding things up and dividing by the number there are, you are adding things up, then square rooting, then dividing by the number there are. In fact the construct before the 1/n is a Euclidean distance – the total distance that the sample is from the predicted y-vector. As pointed out by Amin's answer, this error naturally grows as sqrt of the size of the y-vector, so by dividing by n your error will systematically get smaller the larger the sample.