# Solved – Linear regression: *Why* can you partition sums of squares

This post refers to a bivariate linear regression model, \$Y_i = beta_0 + beta_1x_i\$ . I have always taken the partitioning of total sum of squares (SSTO) into sum of squares for error (SSE) and sum of squares for the model (SSR) on faith, but once I started really thinking about it, I don't understand why it works…

The part I do understand:

\$y_i\$: An observed value of y

\$bar{y}\$: The mean of all observed \$y_i\$s

\$hat{y}_i\$: The fitted/predicted value of y for a given observation's x

\$y_i – hat{y}_i\$: Residual/error (if squared and added up for all observations this is SSE)

\$hat{y}_i – bar{y}\$: How much the model fitted value differs from the mean (if squared and added up for all observations this is SSR)

\$y_i – bar{y}\$: How much an observed value differs from the mean (if suared and added up for all observations, this is SSTO).

I can understand why, for a single observation, without squaring anything, \$(y_i – bar{y}) = (hat{y}_i – bar{y}) + (y_i – hat{y}_i)\$. And I can understand why, if you want to add things up over all observations, you have to square them or they'll add up to 0.

The part I don't understand is why \$(y_i – bar{y})^2 = (hat{y}_i – bar{y})^2 + (y_i – hat{y}_i)^2\$ (eg. SSTO = SSR + SSE). It seems to be that if you have a situation where \$A = B + C\$, then \$A^2 = B^2 + 2BC + C^2\$, not \$A^2 = B^2 + C^2\$. Why isn't that the case here?

Contents

It seems to be that if you have a situation where \$A = B + C\$, then \$A^2 = B^2 + 2BC + C^2\$, not \$A^2 = B^2 + C^2\$. Why isn't that the case here?

Conceptually, the idea is that \$BC = 0\$ because \$B\$ and \$C\$ are orthogonal (i.e. are perpendicular).

In the context of linear regression here, the residuals \$epsilon_i = y_i – hat{y}_i\$ are orthogonal to the demeaned forecast \$hat{y}_i – bar{y}\$. The forecast from linear regression creates an orthogonal decomposition of \$mathbf{y}\$ in a similar sense as \$(3,4) = (3,0) + (0,4)\$ is an orthogonal decomposition.

### Linear Algebra version:

Let:

\$\$mathbf{z} = begin{bmatrix} y_1 – bar{y} \ y_2 – bar{y}\ ldots \ y_n – bar{y} end{bmatrix} quad quad mathbf{hat{z}} = begin{bmatrix} hat{y}_1 – bar{y} \ hat{y}_2 – bar{y} \ ldots \ hat{y}_n – bar{y} end{bmatrix} quad quad boldsymbol{epsilon} = begin{bmatrix} y_1 – hat{y}_1 \ y_2 – hat{y}_2 \ ldots \ y_n – hat{y}_n end{bmatrix} = mathbf{z} – hat{mathbf{z}}\$\$

Linear regression (with a constant included) decomposes \$mathbf{z}\$ into the sum of two vectors: a forecast \$hat{mathbf{z}}\$ and a residual \$boldsymbol{epsilon}\$

\$\$ mathbf{z} = hat{mathbf{z}} + boldsymbol{epsilon} \$\$

Let \$langle .,. rangle\$ denote the dot product. (More generally, \$langle X,Y rangle\$ can be the inner product \$E[XY]\$.)

begin{align*} langle mathbf{z} , mathbf{z} rangle &= langle hat{mathbf{z}} + boldsymbol{epsilon}, hat{mathbf{z}} + boldsymbol{epsilon} rangle \ &= langle hat{mathbf{z}}, hat{mathbf{z}} rangle + 2 langle hat{mathbf{z}},boldsymbol{epsilon} rangle + langle boldsymbol{epsilon},boldsymbol{epsilon} rangle \ &= langle hat{mathbf{z}}, hat{mathbf{z}} rangle + langle boldsymbol{epsilon},boldsymbol{epsilon} rangle end{align*}

Where the last line follows from the fact that \$langle hat{mathbf{z}},boldsymbol{epsilon} rangle = 0\$ (i.e. that \$hat{mathbf{z}}\$ and \$boldsymbol{epsilon} = mathbf{z}- hat{mathbf{z}}\$ are orthogonal). You can prove \$hat{mathbf{z}}\$ and \$boldsymbol{epsilon}\$ are orthogonal based upon how the ordinary least squares regression constructs \$hat{mathbf{z}}\$.

\$hat{mathbf{z}}\$ is the linear projection of \$mathbf{z}\$ onto the subspace defined by the linear span of the regressors \$mathbf{x}_1\$, \$mathbf{x}_2\$, etc…. The residual \$boldsymbol{epsilon}\$ is orthogonal to that entire subspace hence \$hat{mathbf{z}}\$ (which lies in the span of \$mathbf{x}_1\$, \$mathbf{x}_2\$, etc…) is orthogonal to \$boldsymbol{epsilon}\$.

Note that as I defined \$langle .,.rangle\$ as the dot product, \$langle mathbf{z} , mathbf{z} rangle = langle hat{mathbf{z}}, hat{mathbf{z}} rangle + langle boldsymbol{epsilon},boldsymbol{epsilon} rangle \$ is simply another way of writing \$sum_i (y_i – bar{y})^2 = sum_i (hat{y}_i – bar{y})^2 + sum_i (y_i – hat{y}_i)^2\$ (i.e. SSTO = SSR + SSE)

Rate this post