Often times, several theorems in mathematical statistics depend on regularity conditions.
For example, there is a theorem that states:
"Let $hat{theta}_n$ be the maximum likelihood estimator of $theta$ based on the sample $X_n = (X_1,…,X_n)$, then under regularity conditions, $hat{theta}_n$ is a consistent estimator of $theta$.
Three of the regularity conditions I typically see are:
1) The set $mathbb{X} = { x_1 in mathbb{R}: L_{x_1}(theta)>0 }$ doesn't depend on $theta$.
2) If $L_{x_1}(theta_1) = L_{x_1}(theta_2)$ for almost all $x_1 in mathbb{X}$, then $theta_1 = theta_2$.
3) The parameter space $phi$ of the unknown parameter $theta$ is an open subset (though not necessarily a proper subset) of the real line.
I do not understand 2) and 3). Would anyone be able to give me an intuitive explanation of number 3) and perhaps why we need 2) to hold?
Best Answer
Hm, you need much more than those three conditions for MLE to be consistent.
The log-likelihood function $L(X, theta)$ (if I am interpreting your notation correctly) is a random function of the data $X$. Your condition 2) says that the family of all possible realizations of this random function ${ L(x, theta) }_{x in X}$ separates points in your parameter space $Theta$. In other words, the log-likelihood function can tell the different parameters apart. So identification of parameters via MLE at least makes sense.
So how does MLE go? You take a realization of $L(X, theta)$, find the argmax, and that is your estimate. Just for the estimation part, it's good to have $L(X, theta)$ differentiable. That's where 3) comes in. You can only talk about differentiability on open sets.
This is just dotting the i' and crossing the t's so far. None of these brings MLE even close to consistency. To get consistency, you need $L(x, theta, n)$ converge, uniformly over $Theta$, in probability to a non-random function that is maximized at the true parameter $theta_0$.
If you divided the log-likelihood by the sample size $n$, you see that this non-random function should naturally be the Kullback-Liebler divergence.
To get uniformly close to the KL divergence, you need Lebesgue's dominated convergence theorem to be applicable to $L(x, theta)$ on a small neighborhood $N_theta$ of $theta$. Roughly speaking, this says that if $L(x, theta)$ doesn't change too much on $N_theta$ for each $x$, then you can uniformly bound the change of $L(x, theta)$ over a set of $x$'s with probability close to $1$. This is the most crucial condition.