How can you prove that the normal equations: $(X^TX)beta = X^TY$ have one or more solutions without the assumption that X is invertible?

My only guess is that it has something to do with generalized inverse, but I am totally lost.

**Contents**hide

#### Best Answer

One is tempted to be glib and point out that because the quadratic form

$$beta to (Y – Xbeta)'(Y – Xbeta)$$

is positive semi-definite, there exists a $beta$ for which it is minimum and that minimum is found (by setting the gradient with respect to $beta$ to zero) with the normal equations

$$X'X(Y – Xbeta) = 0,$$

whence **there must be at least one solution regardless of the rank of $X'X$**. However, this argument does not seem to be in the spirit of the question, which appears to be a purely algebraic statement. Perhaps it's of interest to understand why such an equation must have a solution and under precisely what conditions. So let's start over and pretend we don't know the connection with least squares.

It all comes down to the meaning of $X'$, the transpose of $X$. This will turn out to be a matter of a simple definition, appropriate notation, and the concept of a *nondegenerate sesquilinear form.* Recall that $X$ is the "design matrix" of $n$ rows (one for each observation) and $p$ columns (one for each variable, including a constant if any). It therefore represents a linear transformation from the vector space $mathbb V = mathbb{R}^p$ to $mathbb W = mathbb{R}^n$.

The transpose of $X$, thought of as a linear transformation, is a linear transformation of the *dual spaces* $X': mathbb{W}^* to mathbb{V}^*$. In order to make sense of a composition like $X'X$, then, it is necessary to identify $mathbb{W}^*$ with $mathbb{W}$. That's what the usual inner product (sum of squares) on $mathbb{W}$ does.

There are actually two inner products $g_V$ and $g_W$ defined on $mathbb V$ and $mathbb W$ respectively. These are real-valued bilinear symmetric functions that are *non-degenerate*. The latter means that

$$g_W(u, v) = 0 forall uin mathbb W implies v = 0,$$

with analogous statements for $g_V$. Geometrically, these inner products enable us to measure length and angle. The condition $g(u,v)=0$ can be thought of as $u$ being "perpendicular" to or "orthogonal to" $v$. ("Perpendicular" is a geometric term while "orthogonal" is the more general algebraic term. See Michael J. Wichura (2006), *The Coordinate-Free Approach to Linear Models* at p. 7.)

Nondegeneracy means that only the zero vector is perpendicular to the entire vector space. (This generality means that the results obtained here will apply to the *generalized least squares* setting, for which $g_W$ is not necessarily the usual inner product given as the sum of products of components, but is some arbitrary nondegenerate form. We could dispense with $g_V$ altogether, defining $X':mathbb Wtomathbb V^*$, but I expect many readers to be unfamiliar or uncomfortable with dual spaces and so choose to avoid this formulation.)

With these inner products in hand, the transpose of any linear transformation $X: mathbb V to mathbb W$ is defined by $X': mathbb W to mathbb V$ via

$$g_V(X'(w), v) = g_W(w, X(v))$$

for all $win mathbb W$ and $vin mathbb V$. That there actually exists a vector $X'(w) in mathbb V$ with this property can be established by writing things out with bases for $mathbb V$ and $mathbb W$; that this vector is unique follows from the non-degeneracy of the inner products. For if $v_1$ and $v_2$ are two vectors for which $g_V(v_1,v)=g_V(v_2,v)$ for all $vinmathbb V$, then (from the linearity in the first component) $g_V(v_1-v_2,v)=0$ for all $v$ implying $v_1-v_2=0$.

When $mathbb U subset mathbb W,$ write $mathbb{U}^perp$ for the set of all vectors perpendicular to every vector in $mathbb U$. Also as a matter of notation, write $X(mathbb V)$ for the image of $X$, defined to be the set ${X(v) | v in mathbb V} subset mathbb W$. **A fundamental relationship** between $X$ and its transpose $X'$ is

$$X'(w) = 0 iff w in X(mathbb V)^perp.$$

**That is, $w$ is in the kernel of $X'$ if and only if $w$ is perpendicular to the image of $X$.** This assertion says two things:

If $X'(w) = 0$, then $g_W(w, X(v)) = g_V(X'(w),v) = g_V(0,v)=0$ for all $vinmathbb V$, which merely means $w$ is perpendicular to $X(V)$.

If $w$ is perpendicular to $X(mathbb V)$, that only means $g_W(w, X(v)) = 0$ for all $vinmathbb V$, but this is equivalent to $g_V(X'(w), v) = 0$ and nondegeneracy of $g_V$ implies $X'(w)=0$.

**We're actually done now.** The analysis has shown that $mathbb W$ decomposes as a direct product $mathbb W = X(mathbb V) oplus X(mathbb V)^perp$. That is, we can take any *arbitrary* $y in mathbb W$ and write it uniquely as $y = y_0 + y^perp$ with $y_0in X(mathbb V)$ and $y^perp in X(mathbb V)^perp$. That means $y_0$ is of the form $X(beta)$ for at least one $betainmathbb V$. Notice, then, that

$$y – Xbeta = (y_0 + y^perp) – y_0 = y^perp in X(mathbb V)^perp$$

The fundamental relationship says that is the same as the left hand side being in the kernel of $X'$:

$$X'(y – Xbeta) = 0,$$

whence $beta$ solves the normal equations $X'Xbeta = X'y.$

**We are now in a position to give a brief geometric answer to the question** (along with some revealing comments): the normal equations have a solution because any $n$-vector $yinmathbb W$ decomposes (uniquely) as the sum of a vector $y_0$ in the range of $X$ and another vector $y^perp$ perpendicular to $y_0$ and $y_0$ is the image of at least one $p$-vector $betainmathbb V$. The dimension of the image $X(mathbb V)$ (its *rank*) is the dimension of the *identifiable* parameters. The dimension of the kernel of $X$ counts the nontrivial linear relations among the parameters. All parameters are identifiable when $X$ is a one-to-one map from $mathbb V$ to its image in $mathbb W$.

It is ultimately useful to dispense with the space $mathbb V$ altogether and work entirely with the subspace $mathbb U = X(mathbb V)subsetmathbb W$, the "column space" of the matrix $X$. The normal equations amount to orthogonal projection onto $mathbb U$. That frees us conceptually from being tied to any particular parameterization of the model and shows that least-squares models have an *intrinsic dimension* independent of how they happen to be parameterized.

One interesting outcome of this abstract algebraic demonstration is that *we can solve the normal equations in arbitrary vector spaces.* The result holds, say, for complex spaces, for spaces over finite fields (where minimizing a sum of squares makes little sense), and even over infinite-dimensional spaces that support suitable sequilinear forms.

### Similar Posts:

- Solved – Why do we call the equations of least square estimation in linear regression the *normal equations*
- Solved – Why trace of $I−X(X′X)^{-1}X′$ is $n-p$ in least square regression when the parameter vector $beta$ is of p dimensions
- Solved – Why trace of $I−X(X′X)^{-1}X′$ is $n-p$ in least square regression when the parameter vector $beta$ is of p dimensions
- Solved – Uncertainty in Linear Regression Coefficients
- Solved – Proof of normal equation in regression using tensor notation