I was reading this wikipedia article related to kriging . I didn't understand the part when it says that
Kriging computes the best linear unbiased estimator, $hat Z (x_0)$, of $Z(x_0)$ such that kriging variance of is minimized with the unbiasedness condition. I didn't get the derivation and also how the variance is minimized. Any suggestions?
Specially, I didn't get the part where applies minimized subject to unbiasedness condition.
I think it should have been
E[Z'(x0)-Z(x0)] instead of E[Z'(x)-Z(x)] isn't it. ' is equivalent to hat in the wiki article. Also I didn't get how the kriging error is derived
Best Answer
Suppose $left(Z_0, Z_1, ldots, Z_nright)$ is a vector assumed to have a multivariate distribution of unknown mean $(mu, mu, ldots, mu)$ and known variance-covariance matrix $Sigma$. We observe $left(z_1, z_2, ldots, z_nright)$ from this distribution and wish to predict $z_0$ from this information using an unbiased linear predictor:
- Linear means the prediction must take the form $hat{z_0} = lambda_1 z_1 + lambda_2 z_2 + cdots + lambda_n z_n$ for coefficients $lambda_i$ to be determined. These coefficients can depend at most on what is known in advance: namely, the entries of $Sigma$.
This predictor can also be considered a random variable $hat{Z_0} = lambda_1 Z_1 + lambda_2 Z_2 + cdots + lambda_n Z_n$.
- Unbiased means the expectation of $hat{Z_0}$ equals its (unknown) mean $mu$.
Writing things out gives some information about the coefficients:
$$eqalign{ mu &= E[hat{Z_0}] = E[lambda_1 Z_1 + lambda_2 Z_2 + cdots + lambda_n Z_n] \ &= lambda_1 E[Z_1] + lambda_2 E[Z_2] + cdots + lambda_n E[Z_n] \ &= lambda_1 mu + cdots + lambda_n mu \ &= left(lambda_1 + cdots + lambda_nright) mu. \ }$$
The second line is due to linearity of expectation and all the rest is simple algebra. Because this procedure is suppose to work regardless of the value of $mu$, evidently the coefficients have to sum to unity. Writing the coefficients in vector notation $lambda = (lambda_i)'$, this can be neatly written $mathbf{1}lambda=1$.
Among the set of all such unbiased linear predictors, we seek one that deviates as little from the real value as possible, measured in the room mean square. This, again, is a computation. It relies on the bilinearity and symmetry of covariance, whose application is responsible for the summations in the second line:
$$eqalign{ E[(hat{Z_0} – Z_0)^2] &= E[(lambda_1 Z_1 + lambda_2 Z_2 + cdots + lambda_n Z_n – Z_0)^2] \ &= sum_{i=1}^n sum_{j=1}^n lambda_i lambda_j text{var}[Z_i, Z_j]-2sum_{i=1}^nlambda_i text{var}[Z_i, Z_0] + text{var}[Z_0, Z_0] \ &= sum_{i=1}^n sum_{j=1}^n lambda_i lambda_j Sigma_{i,j} – 2sum_{i=1}^nlambda_iSigma_{0,i} + Sigma_{0,0}. }$$
Whence the coefficients can be obtained by minimizing this quadratic form subject to the (linear) constraint $mathbf{1}lambda=1$. This is readily solved using the method of Lagrange multipliers, yielding a linear system of equations, the "Kriging equations."
In the application, $Z$ is a spatial stochastic process ("random field"). This means that for any given set of fixed (not random) locations $mathbf{x_0}, ldots, mathbf{x_n}$, the vector of values of $Z$ at those locations, $left(Z(mathbf{x_0}), ldots, Z(mathbf{x_n})right)$ is random with some kind of a multivariate distribution. Write $Z_i = Z(mathbf{x_i})$ and apply the foregoing analysis, assuming the means of the process at all $n+1$ locations $mathbf{x_i}$ are the same and assuming the covariance matrix of the process values at these $n+1$ locations is known with certainty.
Let's interpret this. Under the assumptions (including constant mean and known covariance), the coefficients determine the minimum variance attainable by any linear estimator. Let's call this variance $sigma_{OK}^2$ ("OK" is for "ordinary kriging"). It depends solely on the matrix $Sigma$. It tells us that if we were to repeatedly sample from $left(Z_0, ldots, Z_nright)$ and use these coefficients to predict the $z_0$ values from the remaining values each time, then
On the average our predictions would be correct.
Typically, our predictions of the $z_0$ would deviate about $sigma_{OK}$ from the actual values of the $z_0$.
Much more needs to be said before this can be applied to practical situations like estimating a surface from punctual data: we need additional assumptions about how the statistical characteristics of the spatial process vary from one location to another and from one realization to another (even though, in practice, usually only one realization will ever be available). But this exposition should be enough to follow how the search for a "Best" Unbiased Linear Predictor ("BLUP") leads straightforwardly to a system of linear equations.
By the way, kriging as usually practiced is not quite the same as least squares estimation, because $Sigma$ is estimated in a preliminary procedure (known as "variography") using the same data. That is contrary to the assumptions of this derivation, which assumed $Sigma$ was known (and a fortiori independent of the data). Thus, at the very outset, kriging has some conceptual and statistical flaws built into it. Thoughtful practitioners have always been aware of this and found various creative ways to (try to) justify the inconsistencies. (Having lots of data can really help.) Procedures now exist for simultaneously estimating $Sigma$ and predicting a collection of values at unknown locations. They require slightly stronger assumptions (multivariate normality) in order to accomplish this feat.