Solved – Akaike Information Criterion (AIC) derivation

I am trying to understand the Akaike Information Criterion (AIC) derivation and this resource explains it quite well, although there are some mysteries for me.

First of all, it considers $hat{theta}$ as the parameters resulting from Maximum Likelihood Estimation (MLE) and it says the difference from the true model can be computed using the Kullback-Leibler distance:

$$int p(y) log p(y) dy – int p(y) log hat{p}_j(y) dy$$

Minimising such a distance is equivalent to maximising the second term referred to as $K$.
One trivial estimation of $K$ estimation is

$$bar{K} = frac{1}{N} sum_{i=1}^N log p(Y_i, hat{theta}) = frac{ell_j(hat{theta})}{N}$$

Suppose $theta_0$ minimises $K$ and let

$$s(y,theta) = frac{partial log p (y, theta)}{partial theta}$$

be the score and $H(y,theta)$ the matrix of second derivatives.

  1. The author later in the proof uses the fact the score has $0$ mean: based on what?

Then it says: let $$Z_n = sqrt{n} (hat{theta} – theta_0)$$

and recall that $$Z_nrightarrow mathcal{N}(0, J^{-1}VJ^{-1})$$

where
$$J = -E[H(Y,theta_0)]$$

and
$$V= Var(s(Y, theta_0)$$.

  1. Why $$Z_n = sqrt{n} (hat{theta} – theta_0)$$? Where does it come from?

Then let

$$S_n = frac{1}{n} sum_{i=1}^Ns(Y_i, theta_0)$$

It says that by the Central limit theorem
$$sqrt{n}S_n rightarrow mathcal{N}(0,V)$$

  1. $V$ comes from the definition but why $0$ mean? Where does it come from?
  2. At some point it says:
    $$J_n = -frac{1}{n}sum_{i=1}^NH(Y_i, theta_0) – xrightarrow{P} J$$
    What's the meaning of $- xrightarrow{P} J$?

EDIT

Additional question.
Defining
$$K_0 = int p(y) log p(y, theta_0) dy $$

and
$$A_N = frac{1}{N} sum_{i=1}^N(ell(Y_i,theta_0)-K_0)$$
Why $$E[A_N] =0$$?

Consider scalar parameters $theta_0$ and the corresponding scalar estimate $hat theta$ for simplicity.

I will answer Q1 and Q3 which are essentially asking why is the mean of the score function $Bbb{E}_{theta}(s(theta)) =0 $. This is a widely known result.. To put it simply, Notice that score function $s(theta)$ depends of the random observations $X$. We can take its expectation as follows:

begin{align} Bbb{E}_{theta}(s) & = int_x f(x;theta) frac{partial log f(x;theta)}{partial theta} dx \ &=int_x frac{partial f(x;theta)}{partial theta} dx = 0 qquad text{(exchanging integral and derivative)} end{align}

Now, notice that $S_n$ is nothing but averaged-sum of score functions based on independent observations. Hence, its expectation will also be zero.

For Q2) the motivation is to find study the asymptotic properties of our estimator wrt to the true parameter. Let $hat{theta}$ be the maximizer of $L_{n}(theta)=frac{1}{n} sum_{i=1}^{n} log fleft(X_{i} | thetaright)$. Now, by meanvalue theorem begin{align} 0=L_{n}^{prime}(hat{theta}) & =L_{n}^{prime}left(theta_{0}right)+L_{n}^{prime prime}left(hat{theta}_{1}right)left(hat{theta}-theta_{0}right) quad text{(for some $theta_1 in [hattheta,theta_0]$)}\ implies & left(hat{theta}-theta_{0}right) = frac{L_{n}^{prime}left(theta_{0}right)}{L_{n}^{prime prime}left(hat{theta}_{1}right)} end{align}

Consider the numerator: begin{align} sqrt{n}left(frac{1}{n} sum_{i=1}^{n} l^{prime}left(X_{i} | theta_{0}right)-mathbb{E}_{theta_{0}} l^{prime}left(X_{1} | theta_{0}right)right) & = sqrt{n}(S_n – Bbb{E}(S_n)) \ & rightarrow Nleft(0, operatorname{Var}_{theta_{0}}left(l^{prime}left(X_{1} | theta_{0}right)right)right) = N(0,V) end{align}

Now, the denominator $L^{''}_n$ coverges to the Fisher's information $(J)$ by LLN. Therefore, for the scalar paramters case, we can see that $$sqrt{n}(hat theta – theta_0) rightarrow N(0,frac{V}{J^2})$$

Similar Posts:

Rate this post

Leave a Comment