This question seems fundamental enough that I'm convinced it has been answered here somewhere, but I haven't found it.

I understand that if the dependent variable in a regression is normally distributed, maximum likelihood and ordinary least squares produce the same parameter estimates.

When the dependent variable is *not* normally distributed, the OLS parameter estimates are no longer equivalent to MLE but they are still Best (minimum variance) Linear Unbiased estimates (BLUE).

**So, what are the properties of MLE that make it desirable beyond what OLS has to offer (being BLUE)?**

**In other words, what do I lose if I can't say my OLS estimates are maximum likelihood estimates?**

To motivate this question a little: I'm wondering why I would want to choose a regression model other than OLS in the presence of a clearly non-normal dependent variable.

**Contents**hide

#### Best Answer

As you move sufficiently far away from normality, *all linear estimators may be arbitrarily bad*.

Knowing that you can get the best of a bad lot (i.e. the *best* linear unbiased estimate) isn't much consolation.

If you can specify a suitable distributional model (*ay, there's the rub*), maximizing the likelihood has both a direct intuitive appeal – in that it "maximizes the chance" of seeing the sample you did actually see (with a suitable refinement of what we mean by that for the continuous case) and a number of very neat properties that are both theoretically and practically useful (e.g. relationship to the Cramer-Rao lower bound, equivariance under transformation, relationship to likelihood ratio tests and so forth). This motivates M-estimation for example.

Even when you can't specify a model, it is possible to construct a model for which ML is robust to contamination by gross errors in the conditional distribution of the response — where it retains pretty good efficiency at the Gaussian but avoids the potentially disastrous impact of arbitrarily large outliers.

[That's not the only consideration with regression, since there's also a need for robustness to the effect of influential outliers for example, but it's a good initial step]

As a demonstration of the problem with even the best linear estimator, consider this comparison of slope estimators for regression. In this case there are 100 observations in each sample, x is 0/1, the true slope is $frac12$ and errors are standard Cauchy. The simulation takes 1000 sets of simulated data and computes the least squares estimate of slope ("LS") as well as a couple of nonlinear estimators that could be used in this situation (neither is fully efficient at the Cauchy but they're both reasonable) – one is an L1 estimator of the line ("L1") and the second computes a simple L-estimate of location at the two values of x and fits a line joining them ("LE").

The top part of the diagram is a boxplot of those thousand slope estimates for each simulation. The lower part is the central one percent (roughly, it is marked with a faint orange-grey box in the top plot) of that image "blown up" so we can see more detail. As we see the least squares slopes range from -771 to 1224 and the lower and upper quartiles are -1.24 and 2.46. The error in the LS slope was over 10 more than 10% of the time. The two nonlinear estimators do much better — they perform fairly similarly to each other, none of the 1000 slope estimates in either case are more than 0.84 from the true slope and the median absolute error in the slope is in the ballpark of 0.14 for each (vs 1.86 for the least squares estimator). The LS slope has a RMSE of 223 and 232 times that of the L1 and LE estimators in this case (that's not an especially meaningful quantity, however as the LS estimator doesn't have a finite variance when you have Cauchy errors).

There are dozens of other reasonable estimators that might have been used here; this was simply a quick calculation to illustrate that even the best/most efficient linear estimators may not be useful. An ML estimator of the slope would perform better (in the MSE sense) than the two robust estimators used here, but in practice you'd want something with some robustness to influential points.