Suppose we have a linear model for a dependent variable $y$ in terms of two independent variables $x_1$ and $x_2$, given by $y_i=x_{i1} beta_1+x_{i2}beta_2+epsilon_i$.

If we were to estimate the parameters $beta_1$ and $beta_2$ by ML we would have to specify a distribution for $epsilon_i$ (assuming $x_1$ and $x_2$ are 'fixed'). Suppose we choose two different density functions $f$ and $g$ for $epsilon$. Does it make sense to compare the two corresponding maximum likelihood values for the two models to decide on which error distribution is more appropriate?

My intuition would tell me that this is not a correct approach because likelihood values are not absolutely comparable if we go from one distribution to another.

**Contents**hide

#### Best Answer

To get a sense of the problem, contemplate that density functions used to define likelihood functions are defined *with respect to* some dominating measure. So if we change the dominating measure, the likelihood function will change.

With more details (but informally) let the statistical model be given as a family of probability measures $P(cdot; theta)$ where $theta$ indexes a family of probability measures. We must assume that all this measures are absolutely continuous with respect to some dominating measure $mu$. Then we can write $$ P(A;theta) = int_A f(x;theta) mu(dx) $$ where $f(cdot;theta)$ is the Radon-Nikodym derivative of $P(cdot;theta)$ with respect to $mu$. But the dominating measure $mu$ will not be unique, suppose we change to define densities with respect to some other dominating measure $lambda$, equivalent to $mu$ (meaning that they have the same null sets). The likelihood function defined with respect to $mu$ is $$ f(x;theta) $$ (viewed as a function of $theta$ for given $x$). The likelihood function with respect to $lambda$ becomes $$ f(x;theta) frac{mu}{lambda}(x) $$ where $frac{mu}{lambda}$ is the Radon-Nikodym derivative of $mu$ with respect to $lambda$.

So by changing the dominating measure can we get many different versions of the likelihood functions, but they will all be proportional (as functions of $theta$), since the factor $frac{mu}{lambda}(x)$ do not depend on $theta$. See also What does "likelihood is only defined up to a multiplicative constant of proportionality" mean in practice?.

One consequence of this is that to be able to compare likelihoods (and then AIC) for different models, the likelihoods must be defined with respect to *the same dominating measure*. This also implies that they must be defined for exactly the same data. Sometimes one uses continuous models as approximations for discrete data. If one contemplates both continuous and discrete models, these two kinds of models **cannot** be compared with AIC, since they use different dominating measures (Lebesgue measure, counting measure).

A point raised in one comment is about nested models. Some theoreticians hold that AIC can only be used to compare nested models. Others disagree. But, if you want to use AIC to compare non-nested model classes, you have to be careful. AIC as implemented in R, for instance, is based on likelihoods where "irrelevant constants" are neglected. That have the effect of making this AIC's noncomparable! So, if you still want to do it, you must program the AIC calculations yourself.