I know that model sum of squares is the ratio of the between-group sum of squares to the model degrees of freedom, and that the between-group sum of squares is a variation between cluster means (and the smaller the variation the closer they are) and that the model has k-1 degrees of freedom where k is the number of clusters. But the size of the groups and the # of clusters have nothing to do with each other, so between group sum of squares and # of clusters have no effect on each other (such as inversely proportional, etc)

And degrees of freedom can't be lower than 0 (1-0 is the lowest use case).

In that case, what would it mean if you had a model sum of squares = .873 ? Why is it an integer if it is supposed to be a ratio? The degrees of freedom would be 2 in this case also, so shouldn't it be (between group sum of squares: 2 )?

#### Best Answer

The *model sum of squares*, also known as the *explained sum of squares*, is the sum of squared deviations of the model's predicted value $hat{y}_i$ from the outcome variable's unconditional mean $bar{y}$:

$$ mathit{ESS} = sum_{i=1}^n (hat{y}_i – bar{y})^2 $$

In linear regression, the total sum of squares equals the explained sum of squares plus the residual sum of squares because the residuals are statistically orthogonal (by construction) to the explanatory variables.

$$ underbrace{sum_{i=1}^n (y_i – bar{y})^2}_{mathit{TSS}} = underbrace{sum_{i=1}^n (hat{y}_i – bar{y})^2}_{mathit{ESS}} + underbrace{sum_{i=1}^n (y_i – hat{y}_i)^2}_{mathit{RSS}}$$

**The residuals' degree of freedom is an entirely different concept.** The residuals degree of freedom is the dimension of the linear subspace in which the residual vector lies.

## Some intuition and motivation for degrees of freedom

Imagine have some vector $ boldsymbol{epsilon} = (x,y,z) in mathbb{R}^3$, that is, $boldsymbol{epsilon}$ is some point in three dimensional space.

### Scenario 1:

Q: You are told $x + y + z = 1$. What's the space of points $(x,y,z)$ that satisfy that constraint?

A: a plane: . A plane is a two-dimensional linear subspace. That restriction implies $boldsymbol{epsilon}$ lies in a 2 dimensional linear subspace of $mathbb{R}^3$.

### Scenario 2:

Q: You are told $x + y + z = 1$ and that $y + z = 0$. What's the space of points $(x,y, z)$ that satisfy those constraints?

A: It's a line. . A line is a one dimensional linear subspace. The restrictions imply $boldsymbol{epsilon}$ lies in a one dimensional linear subspace of $mathbb{R}^3$.

## So what's residuals' degrees of freedom?

In linear regression, your residual vector is an $n$ dimensional vector where $n$ is the number of observations.

$$ boldsymbol{epsilon} = begin{bmatrix} epsilon_1 \ epsilon_2 \ ldots \ epsilon_n end{bmatrix} $$

So the residual vector $boldsymbol{epsilon}$ could take any value in $mathbb{R}^n$? **No!**

For each coefficient you estimate, you impose the constraint that the residual vector is orthogonal to associated right hand side variable. If you're running the regression: $$y_i = b_0 + b_1 x_{i,1} + b_2 x_{i,2} + ldots + b_k x_{i,k} + $$

you have $k+1$ linear constraints. In matrix form, the $k+1$ equations can be written as $X'X mathbf{b} = X'mathbf{y}$. Hence the residual vector is restricted to an $n-k-1$ dimensional linear subspace of $mathbb{R}^n$.

### Similar Posts:

- Solved – R squared and higher order polynomial regression
- Solved – Linear regression: *Why* can you partition sums of squares
- Solved – Degrees of freedom in GLS
- Solved – Why is the R-squared lower in a regression that contains more variables that the sum of R-squared from two difference regression functions
- Solved – Modified K-means with unequal cluster variances