If I am doing standard OLS and want to calculate beta values (OLS estimators), which of the following is the more numerically stable method? And why?

Assuming that the columns of $X$ are already mean-centered and normalised, to solve $Y = Xbeta + epsilon$ do:

1) $hat{beta}_{pinv}=(X'X)^+X'Y$

2) $hat{beta}_{QR} = text{solve}(R,Q'Y)$

Where $^+$ represents the moore-penrose inverse, $Q$ and $R$ come from the QR decomposition of $X$ and `solve`

is a function like the solve functions in python or r.

I would have thought (2) was better as $(X'X)^+$ seems to have a higher condition number than $R$, but in practice (in python at least) I am finding that the beta values derived from (1) minimize the sum of squared residuals better.

**Contents**hide

#### Best Answer

Using the Moore-Penrose pseudo-inverse $X^{dagger}$ of an matrix $X$ is more stable in the sense that can directly account for rank-deficient design matrices $X$. $X^{dagger}$ allows us to naturally employ the identities: $X^{dagger} X X^{dagger} = X$ and $X X^{dagger} X= X^{dagger}$; the matrix $X^{dagger}$ can be used as "surrogate" the true inverse of the matrix $X$, even if the inverse matrix $X^{-1}$ does not exist. In addition, the "usual" way of computing $X^{dagger}$ by employing the Singular Value Decomposition of matrix $X$, where $X = USV^T$, is straight-forward methodologically and computationally well-studied. We simply take the reciprocal of the non-zero singular values in the diagonal matrix $S$, and we are good to go. Moore-Penrose pseudo-inverses are common in many proofs because they "just exist" and *greatly simplify many derivations*.

That said, in most cases it is not good practice to use the Moore-Penrose Pseudo-inverse unless we have a very good reason (e.g. our procedure consistently employs small and potentially rank-degenerate covariance matrices). The reasons are that: 1. It can hide true underlying problems with our data (e.g. duplication of variables) and 2. it is unnecessarily expensive (we have better alternatives). Finally, note that the Moore-Penrose pseudo-inverse of a full rank $X$ can be directed computed through the QR factorization of $X$, $X = QR$, as: $X^{dagger} = [R^{-1}_{1} 0] Q^T$ where $R_1$ is an upper triangular matrix, coming from the "thin/reduced/skinny" QR factorization of $X$. So we do not really gain much if $X$ is full rank anyway. (Gentle's Matrix Algebra: Theory, Computations and Applications in Statistics provides a wealth of information the matter if one wishes to explore this further – Sect. 3.6 on *Generalised Inverses* should be a relevant starting point.)

To elaborate my first point a bit: It is far more natural to use a penalised regression procedure like Ridge or LASSO if we have issues with collinearity or simply have a $pgg n$ (i.e. more predictors than data-points) than hide the problem using $X^dagger$. The condition of a system of equations solved through the employment of Moore-Penrose pseudo-inverses might still be prohibitorily bad, resulting to unstable solutions and/or misleading inference. Thus if numerical stability is an issue, I would suggest using regularisation directly instead of Moore-Penrose pseudo-inverses. Note that in terms of speed, computing $X^{dagger}$ is also problematic; potentially iterative methods based on gradient descent methods or alternating least squares are far faster for large systems (e.g. in Recommender Systems literature, see Paterek (2008) *Improving regularized singular value decomposition for collaborative filtering* for something very concise).

### Similar Posts:

- Solved – Generalized Least Squares using Moore Penrose pseudo inverse
- Solved – What happens if A is not invertible in equation Ax=b
- Solved – Fast Mahalanobis distance computation with singular covariance matrix
- Solved – Solution to Least Squares problem using Singular Value Decomposition
- Solved – reasons for a non-invertible matrix