This post refers to a bivariate linear regression model, $Y_i = beta_0 + beta_1x_i$ . I have always taken the partitioning of total sum of squares (SSTO) into sum of squares for error (SSE) and sum of squares for the model (SSR) on faith, but once I started really thinking about it, I don't understand why it works…
The part I do understand:
$y_i$: An observed value of y
$bar{y}$: The mean of all observed $y_i$s
$hat{y}_i$: The fitted/predicted value of y for a given observation's x
$y_i – hat{y}_i$: Residual/error (if squared and added up for all observations this is SSE)
$hat{y}_i – bar{y}$: How much the model fitted value differs from the mean (if squared and added up for all observations this is SSR)
$y_i – bar{y}$: How much an observed value differs from the mean (if suared and added up for all observations, this is SSTO).
I can understand why, for a single observation, without squaring anything, $(y_i – bar{y}) = (hat{y}_i – bar{y}) + (y_i – hat{y}_i)$. And I can understand why, if you want to add things up over all observations, you have to square them or they'll add up to 0.
The part I don't understand is why $(y_i – bar{y})^2 = (hat{y}_i – bar{y})^2 + (y_i – hat{y}_i)^2$ (eg. SSTO = SSR + SSE). It seems to be that if you have a situation where $A = B + C$, then $A^2 = B^2 + 2BC + C^2$, not $A^2 = B^2 + C^2$. Why isn't that the case here?
Best Answer
It seems to be that if you have a situation where $A = B + C$, then $A^2 = B^2 + 2BC + C^2$, not $A^2 = B^2 + C^2$. Why isn't that the case here?
Conceptually, the idea is that $BC = 0$ because $B$ and $C$ are orthogonal (i.e. are perpendicular).
In the context of linear regression here, the residuals $epsilon_i = y_i – hat{y}_i$ are orthogonal to the demeaned forecast $hat{y}_i – bar{y}$. The forecast from linear regression creates an orthogonal decomposition of $mathbf{y}$ in a similar sense as $(3,4) = (3,0) + (0,4)$ is an orthogonal decomposition.
Linear Algebra version:
Let:
$$mathbf{z} = begin{bmatrix} y_1 – bar{y} \ y_2 – bar{y}\ ldots \ y_n – bar{y} end{bmatrix} quad quad mathbf{hat{z}} = begin{bmatrix} hat{y}_1 – bar{y} \ hat{y}_2 – bar{y} \ ldots \ hat{y}_n – bar{y} end{bmatrix} quad quad boldsymbol{epsilon} = begin{bmatrix} y_1 – hat{y}_1 \ y_2 – hat{y}_2 \ ldots \ y_n – hat{y}_n end{bmatrix} = mathbf{z} – hat{mathbf{z}}$$
Linear regression (with a constant included) decomposes $mathbf{z}$ into the sum of two vectors: a forecast $hat{mathbf{z}}$ and a residual $boldsymbol{epsilon}$
$$ mathbf{z} = hat{mathbf{z}} + boldsymbol{epsilon} $$
Let $langle .,. rangle$ denote the dot product. (More generally, $langle X,Y rangle$ can be the inner product $E[XY]$.)
begin{align*} langle mathbf{z} , mathbf{z} rangle &= langle hat{mathbf{z}} + boldsymbol{epsilon}, hat{mathbf{z}} + boldsymbol{epsilon} rangle \ &= langle hat{mathbf{z}}, hat{mathbf{z}} rangle + 2 langle hat{mathbf{z}},boldsymbol{epsilon} rangle + langle boldsymbol{epsilon},boldsymbol{epsilon} rangle \ &= langle hat{mathbf{z}}, hat{mathbf{z}} rangle + langle boldsymbol{epsilon},boldsymbol{epsilon} rangle end{align*}
Where the last line follows from the fact that $langle hat{mathbf{z}},boldsymbol{epsilon} rangle = 0$ (i.e. that $hat{mathbf{z}}$ and $boldsymbol{epsilon} = mathbf{z}- hat{mathbf{z}}$ are orthogonal). You can prove $hat{mathbf{z}}$ and $boldsymbol{epsilon}$ are orthogonal based upon how the ordinary least squares regression constructs $hat{mathbf{z}}$.
$hat{mathbf{z}}$ is the linear projection of $mathbf{z}$ onto the subspace defined by the linear span of the regressors $mathbf{x}_1$, $mathbf{x}_2$, etc…. The residual $boldsymbol{epsilon}$ is orthogonal to that entire subspace hence $hat{mathbf{z}}$ (which lies in the span of $mathbf{x}_1$, $mathbf{x}_2$, etc…) is orthogonal to $boldsymbol{epsilon}$.
Note that as I defined $langle .,.rangle$ as the dot product, $langle mathbf{z} , mathbf{z} rangle = langle hat{mathbf{z}}, hat{mathbf{z}} rangle + langle boldsymbol{epsilon},boldsymbol{epsilon} rangle $ is simply another way of writing $sum_i (y_i – bar{y})^2 = sum_i (hat{y}_i – bar{y})^2 + sum_i (y_i – hat{y}_i)^2$ (i.e. SSTO = SSR + SSE)
Similar Posts:
- Solved – R squared and higher order polynomial regression
- Solved – In linear regression, are the noise terms independent of the coefficient estimators
- Solved – Intuition (geometric or other) of $Var(X) = E[X^2] – (E[X])^2$
- Solved – Is it correct to say one ‘estimates’ or ‘measures’ r-squared
- Solved – Least squares regression when data has error bars