Why is it that "missing data" and "outliers" can affect the performance of least square estimation?

**Contents**hide

#### Best Answer

I'm not sure about the "missing data", but I can give an answer on "outliers"

This is basically due to the "unbounded" influence that a single observation can have in least squares (or at least in conventional least squares). A very, very simple example of least squares should show this. Suppose you only estimate an intercept $mu$ using data $Y_i (i=1,dots,n)$. The least square equation is

$$sum_{i=1}^{n} (Y_i-mu)^2$$

Which is minimised by choosing $hat{mu}=n^{-1}sum_{i=1}^{n} Y_i=overline{Y}$. Now suppose I add one extra observation to the sample, equal to $X$, how will the estimate change? The new best estimate using $n+1$ observations is just $$hat{mu}_{+1}=frac{X+sum_{i=1}^{n} Y_i}{n+1}$$

Rearranging terms gives

$$hat{mu}_{+1}=hat{mu}+frac{1}{n+1}(X-hat{mu})$$

Now for a given sample $hat{mu}$ and $n$ are fixed. So I can essentially "choose" $X$ to get any new average that I want!

Using the same argument, you can show that deleting the $j$th observation has a similar effect:

$$hat{mu}_{-j}=hat{mu}+frac{-1}{n-1}(Y_{j}-hat{mu})$$

And similarly (a bit tediously), you can show that removing $M$ observations gives:

$$hat{mu}_{-M}=hat{mu}+frac{-M}{n-M}(overline{Y}_{M}-hat{mu})$$

Where $overline{Y}_{M}$ is the average of the observations that you removed.

The same kind of thing happens in general least squares, the estimate "chases" the outliers. If you are worried about this, then "least absolute deviations" may be a better way to go (but this can be less efficient if you don't have any outliers).

Influence functions are a good way to study this stuff (outliers and robustness). For example, you can get an approximate change in the variance $s^2=n^{-1}sum_{i=1}^{n}(Y_i-overline{Y})^2$ as:

$$s^2_{-j} = s^2 +frac{-1}{n-1}((Y_j-overline{Y})^2-s^2) + O(n^{-2})$$