I have some samples of data of the form $x,y$ and $z=f(x,y)$. I wish to fit a plane $$z = Ax + By + C$$ to the data with the smallest mean square errors.
I have found an "answer" in section 3 of this document, but it is left in the form of some equations left to be solved. I just about have the ability to solve these equations, but the process gets so messy that the likelihood of me making a mistake is quite high. Surely someone has written out the full solution longhand somewhere (it might be called "closed form") anyway, in the form $$A=ldots,$$ $$B=ldots,$$ $$C=ldots.$$
EDIT: Maybe "closed form" is the wrong phrase. So let me be clear. I want an explicit solution for $A$, $B$, and $C$, and not a solution that ends with "So now if you can solve these equations then you can discover the values of $A$, $B$ and $C$".
Best Answer
Elsewhere on this site, explicit solutions to the ordinary least squares regression
$$mathbb{E}(z_i) = A x_i + B y_i + C$$
are available in matrix form as
$$(C,A,B)^prime = (X^prime X)^{-1} X^prime ztag{1}$$
where $X$ is the "model matrix"
$$X = pmatrix{1 & x_1 & y_1 \ 1 & x_2 & y_2 \ vdots & vdots & vdots \ 1 & x_n & y_n}$$
and $z$ is the response vector
$$z = (z_1, z_2, ldots, z_n)^prime.$$
That's a perfectly fine, explicit, computable answer. But maybe there is some additional understanding that can be wrung out of it by inspecting the coefficients. This can be achieved by choosing appropriate units in which to express the variables.
The best units for this purpose center each variable at its mean and use its standard deviation as the unit of measurement. Explicitly, let the three means be $m_x, m_y,$ and $m_z$ and the three standard deviations be $s_x, s_y,$ and $s_z$. (It turns out not to matter whether you divide by $n$ or $n-1$ in computing the standard deviations. Just make sure you use a consistent convention when you compute any second moment of the data.) The values of the variables in these new units of measurement are
$$xi_i = frac{x_i – m_x}{s_x}, eta_i = frac{y_i – m_y}{s_y}, zeta_i = frac{z_i – m_z}{s_z}.$$
This process is known as standardizing the data. The variables $xi$, $eta$, and $zeta$ are the standardized versions of the original variables $x$, $y$, and $z$.
These relationships are invertible:
$$x_i = s_x xi_i + m_x, y_i = s_y eta_i + m_y, z_i = s_z zeta_i + m_z.$$
Plugging these into the defining relationship
$$mathbb{E}(z_i) = C + Ax_i + By_i$$
and simplifying yields
$$mathbb{E}(s_z zeta_i + m_z) = C + A(s_x xi_i + m_x) + B(s_y eta_i + m_y).$$
Solving for the expectation of the dependent variable $zeta_i$ yields
$$mathbb{E}(zeta_i) = left(frac{C + Am_x + Bm_y – m_z}{s_z}right) + left(frac{A s_x}{s_z}right) xi_i + left(frac{B s_y}{s_z}right) eta_i.$$
If we write these coefficients as $beta_0, beta_1, beta_2$ respectively, then we can recover $A, B, C$ by comparing and solving. For the record this gives
$$A = frac{s_z beta_1}{s_x}, B = frac{s_z beta_2}{s_y},text{ and }C = s_z beta_0 + m_z – A m_x – B m_y.tag{2}$$
The point of this becomes apparent when we consider the new model matrix
$$Xi = pmatrix{1 & xi_1 & eta_i \ 1 & xi_2 & eta_2 \ vdots & vdots & vdots \ 1 & xi_n & eta_n}$$
and the new response matrix $zeta = (zeta_1, zeta_2, ldots, zeta_n)$, because now
$$Xi^prime Xi = pmatrix{n & 0 & 0 \ 0 & n & nrho \ 0 & nrho & n}$$
and
$$Xi^prime zeta = (0, ntau, nupsilon)^prime$$
where $rho$ is the correlation coefficient $frac{1}{n}sum_{i=1}^n xi_i eta_i$, $tau$ is the correlation coefficient $frac{1}{n}sum_{i=1}^n xi_i zeta_i$, and $upsilon$ is the correlation coefficient $frac{1}{n}sum_{i=1}^n eta_i zeta_i$.
To solve the normal equations $(1)$ we may divide both sides by $n$, giving
$$pmatrix{1 & 0 & 0 \ 0 & 1 & rho \ 0 & rho & 1}pmatrix{beta_0 \ beta_1 \ beta_2} = pmatrix{0 \ tau \ upsilon} .$$
What originally looked like a formidable matrix formula has been reduced to a truly elementary set of three simultaneous equations. Provided $|rho| lt 1$, its solution is easily found to be
$$pmatrix{hatbeta_0 \ hatbeta_1 \ hatbeta_2} = frac{1}{1-rho^2}pmatrix{0 \ tau-rhoupsilon \ upsilon-rhotau}.$$
Plugging these into the coefficients in $(2)$ produces the estimates $hat A, hat B,$ and $hat C$.
In fact, even more has been achieved:
It is now evident why the cases $|rho|=1$ are problematic: they introduce a divide-by-zero condition in the solution.
It is equally evident how to determine whether a solution exists when $|rho=1|$ and how to obtain it. It will exist when the second and third normal equations in $Xi$ are redundant and it will be obtained simply by ignoring one of the variables $x$ and $y$ in the first place.
We can derive some insight into the solution generally. For instance, from $hatbeta_0=0$ in all cases, we may conclude that the fitted plane must pass through the point of averages $(m_x, m_y, m_z)$.
It is now evident that the solution can be found in terms of the first two moments of the trivariate dataset $(x, y, z)$. This sheds further light on the fact that coefficient estimates can be found from means and covariance matrices alone.
Furthermore, equation $(2)$ shows that the means are needed only to estimate the intercept term $C$. Estimates of the two slopes $A$ and $B$ require only the second moments.
When the regressors are uncorrelated, $rho=0$ and the solution is that the intercept is zero and the slopes are the correlation coefficients between the response $z$ and the regressors $x$ and $y$ when we standardize the data. This is both easy to remember and provides insight into how regression coefficients are related to correlation coefficients.
Putting this all together, we find that (except in the degenerate cases $|rho|=1$) the estimates can be written
$$eqalign{ hat A &= frac{tau – rhoupsilon}{1-rho^2} frac{s_z}{s_x} \ hat B &= frac{upsilon – rhotau}{1-rho^2} frac{s_z}{s_y} \ hat C &= m_z -m_x hat A – m_y hat B. }$$
In these formulae, the $m_{*}$ are the sample means, the $s_{*}$ are the sample standard deviations, and the greek letters $rho, tau,$ and $upsilon$ represent the three correlation coefficients (between $x$ and $y$, $x$ and $z$, and $y$ and $z$, respectively).
Please note that these formulas are not the best way to carry out the calculations. They all involve subtracting quantities that might be of comparable size, such as $tau-rhoupsilon$, $upsilon-rhotau$, and $m_z – (-m_x hat A – m_y hat B)$. Such subtraction involves loss of precision. The matrix formulation allows numerical analysts to obtain more stable solutions that preserve as much precision as possible. This is why people rarely have any interest in term-by-term formulas. The other reason there is little interest is that as the number of regressors increases, the complexity of the formulas grows exponentially, quickly becoming too unwieldy.
As further evidence of the correctness of these formulas, we may compare their answers to those of a standard least-squares solver, the lm
function in R
.
# # Generate trivariate data. # library(MASS) set.seed(17) n <- 20 mu <- 1:3 Sigma <- matrix(1, 3, 3) Sigma[lower.tri(Sigma)] <- Sigma[upper.tri(Sigma)] <- c(.8, .5, .6) xyz <- data.frame(mvrnorm(n, mu, Sigma)) names(xyz) <- c("x", "y", "z") # # Obtain the least squares coefficients. # beta.hat <- coef(lm(z ~ x + y, xyz)) # # Compute the first two moments via `colMeans` and `cov`. # m <- colMeans(xyz) sigma <- cov(xyz) s <- sqrt(diag(sigma)) rho <- t(t(sigma/s)/s); rho <- as.vector(rho[lower.tri(rho)]) # # Here are the least squares coefficient estimates in terms of the moments. # A.hat <- (rho[2] - rho[1]*rho[3]) / (1 - rho[1]^2) * s[3] / s[1] B.hat <- (rho[3] - rho[1]*rho[2]) / (1 - rho[1]^2) * s[3] / s[2] C.hat <- m[3] - m[1]*A.hat - m[2]*B.hat # # Compare the two solutions. # rbind(beta.hat, formulae=c(C.hat, A.hat, B.hat))
The output exhibits two identical rows of estimates, as expected:
(Intercept) x y beta.hat 1.522571 0.3013662 0.403636 formulae 1.522571 0.3013662 0.403636
Similar Posts:
- Solved – Explicit solution for linear regression with two predictors
- Solved – Explicit solution for linear regression with two predictors
- Solved – How to show this matrix is positive semidefinite
- Solved – Hessian matrix for maximum likelihood
- Solved – KL divergence between two bivariate Gaussian distribution