How can you prove that the normal equations: $(X^TX)beta = X^TY$ have one or more solutions without the assumption that X is invertible?
My only guess is that it has something to do with generalized inverse, but I am totally lost.
Best Answer
One is tempted to be glib and point out that because the quadratic form
$$beta to (Y – Xbeta)'(Y – Xbeta)$$
is positive semi-definite, there exists a $beta$ for which it is minimum and that minimum is found (by setting the gradient with respect to $beta$ to zero) with the normal equations
$$X'X(Y – Xbeta) = 0,$$
whence there must be at least one solution regardless of the rank of $X'X$. However, this argument does not seem to be in the spirit of the question, which appears to be a purely algebraic statement. Perhaps it's of interest to understand why such an equation must have a solution and under precisely what conditions. So let's start over and pretend we don't know the connection with least squares.
It all comes down to the meaning of $X'$, the transpose of $X$. This will turn out to be a matter of a simple definition, appropriate notation, and the concept of a nondegenerate sesquilinear form. Recall that $X$ is the "design matrix" of $n$ rows (one for each observation) and $p$ columns (one for each variable, including a constant if any). It therefore represents a linear transformation from the vector space $mathbb V = mathbb{R}^p$ to $mathbb W = mathbb{R}^n$.
The transpose of $X$, thought of as a linear transformation, is a linear transformation of the dual spaces $X': mathbb{W}^* to mathbb{V}^*$. In order to make sense of a composition like $X'X$, then, it is necessary to identify $mathbb{W}^*$ with $mathbb{W}$. That's what the usual inner product (sum of squares) on $mathbb{W}$ does.
There are actually two inner products $g_V$ and $g_W$ defined on $mathbb V$ and $mathbb W$ respectively. These are real-valued bilinear symmetric functions that are non-degenerate. The latter means that
$$g_W(u, v) = 0 forall uin mathbb W implies v = 0,$$
with analogous statements for $g_V$. Geometrically, these inner products enable us to measure length and angle. The condition $g(u,v)=0$ can be thought of as $u$ being "perpendicular" to or "orthogonal to" $v$. ("Perpendicular" is a geometric term while "orthogonal" is the more general algebraic term. See Michael J. Wichura (2006), The Coordinate-Free Approach to Linear Models at p. 7.)
Nondegeneracy means that only the zero vector is perpendicular to the entire vector space. (This generality means that the results obtained here will apply to the generalized least squares setting, for which $g_W$ is not necessarily the usual inner product given as the sum of products of components, but is some arbitrary nondegenerate form. We could dispense with $g_V$ altogether, defining $X':mathbb Wtomathbb V^*$, but I expect many readers to be unfamiliar or uncomfortable with dual spaces and so choose to avoid this formulation.)
With these inner products in hand, the transpose of any linear transformation $X: mathbb V to mathbb W$ is defined by $X': mathbb W to mathbb V$ via
$$g_V(X'(w), v) = g_W(w, X(v))$$
for all $win mathbb W$ and $vin mathbb V$. That there actually exists a vector $X'(w) in mathbb V$ with this property can be established by writing things out with bases for $mathbb V$ and $mathbb W$; that this vector is unique follows from the non-degeneracy of the inner products. For if $v_1$ and $v_2$ are two vectors for which $g_V(v_1,v)=g_V(v_2,v)$ for all $vinmathbb V$, then (from the linearity in the first component) $g_V(v_1-v_2,v)=0$ for all $v$ implying $v_1-v_2=0$.
When $mathbb U subset mathbb W,$ write $mathbb{U}^perp$ for the set of all vectors perpendicular to every vector in $mathbb U$. Also as a matter of notation, write $X(mathbb V)$ for the image of $X$, defined to be the set ${X(v) | v in mathbb V} subset mathbb W$. A fundamental relationship between $X$ and its transpose $X'$ is
$$X'(w) = 0 iff w in X(mathbb V)^perp.$$
That is, $w$ is in the kernel of $X'$ if and only if $w$ is perpendicular to the image of $X$. This assertion says two things:
If $X'(w) = 0$, then $g_W(w, X(v)) = g_V(X'(w),v) = g_V(0,v)=0$ for all $vinmathbb V$, which merely means $w$ is perpendicular to $X(V)$.
If $w$ is perpendicular to $X(mathbb V)$, that only means $g_W(w, X(v)) = 0$ for all $vinmathbb V$, but this is equivalent to $g_V(X'(w), v) = 0$ and nondegeneracy of $g_V$ implies $X'(w)=0$.
We're actually done now. The analysis has shown that $mathbb W$ decomposes as a direct product $mathbb W = X(mathbb V) oplus X(mathbb V)^perp$. That is, we can take any arbitrary $y in mathbb W$ and write it uniquely as $y = y_0 + y^perp$ with $y_0in X(mathbb V)$ and $y^perp in X(mathbb V)^perp$. That means $y_0$ is of the form $X(beta)$ for at least one $betainmathbb V$. Notice, then, that
$$y – Xbeta = (y_0 + y^perp) – y_0 = y^perp in X(mathbb V)^perp$$
The fundamental relationship says that is the same as the left hand side being in the kernel of $X'$:
$$X'(y – Xbeta) = 0,$$
whence $beta$ solves the normal equations $X'Xbeta = X'y.$
We are now in a position to give a brief geometric answer to the question (along with some revealing comments): the normal equations have a solution because any $n$-vector $yinmathbb W$ decomposes (uniquely) as the sum of a vector $y_0$ in the range of $X$ and another vector $y^perp$ perpendicular to $y_0$ and $y_0$ is the image of at least one $p$-vector $betainmathbb V$. The dimension of the image $X(mathbb V)$ (its rank) is the dimension of the identifiable parameters. The dimension of the kernel of $X$ counts the nontrivial linear relations among the parameters. All parameters are identifiable when $X$ is a one-to-one map from $mathbb V$ to its image in $mathbb W$.
It is ultimately useful to dispense with the space $mathbb V$ altogether and work entirely with the subspace $mathbb U = X(mathbb V)subsetmathbb W$, the "column space" of the matrix $X$. The normal equations amount to orthogonal projection onto $mathbb U$. That frees us conceptually from being tied to any particular parameterization of the model and shows that least-squares models have an intrinsic dimension independent of how they happen to be parameterized.
One interesting outcome of this abstract algebraic demonstration is that we can solve the normal equations in arbitrary vector spaces. The result holds, say, for complex spaces, for spaces over finite fields (where minimizing a sum of squares makes little sense), and even over infinite-dimensional spaces that support suitable sequilinear forms.
Similar Posts:
- Solved – Why do we call the equations of least square estimation in linear regression the *normal equations*
- Solved – Why trace of $I−X(X′X)^{-1}X′$ is $n-p$ in least square regression when the parameter vector $beta$ is of p dimensions
- Solved – Why trace of $I−X(X′X)^{-1}X′$ is $n-p$ in least square regression when the parameter vector $beta$ is of p dimensions
- Solved – Uncertainty in Linear Regression Coefficients
- Solved – Proof of normal equation in regression using tensor notation