I understand the intuition of polynomial regression with higher degree are able to fit the data better, as it is able to decrease your bias but increases the variance. However, I am unable to proof them mathematically. Can anyone help to explain to me why this might be true, i.e it decreases $|| Y – X beta ||^2 $ ? Thank you!

**Contents**hide

#### Best Answer

Consider your matrix $X$ consists of one column vector $x_1$. That is, a single feature.

In linear regression, $beta=(X^TX)^{-1}X^Ty=(x_1^Tx_1)^{-1}x_1^Ty$.

If you add a new variable $x_2$ of degree 2, obtaining $X_2$, then $$ y-X_2(X_2^TX_2)^{-1}X_2^Ty = y-x_1(x_1^Tx_1)^{-1}x_1^Ty -x_2(x_2^Tx_2)^{-1}x_2^Ttilde y_1 $$ where $tilde y_1 = y-x_1(x_1^Tx_1)^{-1}x_1^Ty$.

For any matrix $X$, $pi_X(y)=X(X^TX)^{-1}X^Ty$ is the orthogonal projection of $y$ onto the space spanned by the columns of $X$, which means that $|pi_X(y)_i| leq |y_i|$ for every $i$, where $y_i$ denotes the $i$-th entry of vector $y$.

This means that $$ |y-X_2(X_2^TX_2)^{-1}X_2^Ty|_2^2 leq |y-X(X^TX)^{-1}X^Ty |_2^2 $$ Equality happens only when $x_1(x_1^Tx_1)^{-1}x_1^Tx_2=x_2$, that is, when the addition of column $x_2$ does not increase the rank of $X$.

More generally, if $X_n$ denotes the matrix resulting from adding polynomial variables up to degree $n$, then $$ y-X_n(X_n^TX_n)^{-1}X_n^Ty = y-X_{n-1}(X_{n-1}^TX_{n-1})^{-1}X_{n-1}^Ty -x_n(x_n^Tx_n)^{-1}x_n^Ttilde y_{n-1} $$ and the same reasoning applies.

If the inverse does not exist, one can replace $(X^TX)^{-1}X^T$ the Moore-Penrose pseudoinverse, $X^+$.

### Similar Posts:

- Solved – Show that $hat{beta}_0 = bar{y}$ for OLS when the columns of $mathbf{X}$ are centered
- Solved – Polynomial in linear regression
- Solved – Rationale behind shrinking regression coefficients in Ridge or LASSO regression
- Solved – A modeling technique combining $k$ nearest neighbors and multiple linear regression
- Solved – Logistic Regression – Adding a polynomial basis to the input matrix make sense