# Solved – Taylor series expansion of maximum likelihood estimator, Newton-Raphson, Fisher scoring and distribution of MLE by Delta method

Assume $ellleft(thetaright)$ is the log-likelihood of parameter vector $theta$ and $widehat{theta}$ is the maximum likelihood estimator of $theta$ then the Taylor series of $ellleft(thetaright)$ about $widehat{theta}$
is
begin{align*}
ellleft(thetaright) & approxeqellleft(widehat{theta}right)+frac{partialellleft(thetaright)}{partialtheta}Bigr|_{theta=widehat{theta}}left(theta-widehat{theta}right)+frac{1}{2}left(theta-widehat{theta}right)^{prime}frac{partial^{2}ellleft(thetaright)}{partialthetapartialtheta^{prime}}Bigr|_{theta=widehat{theta}}left(theta-widehat{theta}right)\
\
theta-widehat{theta} & =-left[frac{partial^{2}ellleft(thetaright)}{partialthetapartialtheta^{prime}}Bigr|_{theta=widehat{theta}}right]^{-}left[frac{partialellleft(thetaright)}{partialtheta}Bigr|_{theta=widehat{theta}}right]\
widehat{theta}-theta & =left[frac{partial^{2}ellleft(thetaright)}{partialthetapartialtheta^{prime}}Bigr|_{theta=widehat{theta}}right]^{-}left[frac{partialellleft(thetaright)}{partialtheta}Bigr|_{theta=widehat{theta}}right]\
widehat{theta}-theta & =left[mathbb{H}left(thetaright)Bigr|_{theta=widehat{theta}}right]^{-}left[mathbb{S}left(thetaright)Bigr|_{theta=widehat{theta}}right]
end{align*}

As

$theta=widehat{theta}-left[mathbb{H}left(thetaright)Bigr|_{theta=widehat{theta}}right]^{-}left[mathbb{S}left(thetaright)Bigr|_{theta=widehat{theta}}right]$

So

begin{align*}
\
\
theta^{left(m+1right)} & =theta^{left(mright)}-left[mathbb{E}left{ mathbb{H}left(theta^{left(mright)}right)right} right]^{-}mathbb{S}left(theta^{left(mright)}right)\
end{align*}

Questions

1. I'm not sure that my derivation is correct or not. I have seen at least two different versions of derivation.
2. What would be the mean, variance and distribution of $widehat{theta}$ (Might by Delta method)?
Contents

I will denote $hat theta$ the maximum likelihood estimator, while $theta^{left(m+1right)}$ and $theta^{left(mright)}$ are any two vectors. $theta_0$ will denote the true value of the parameter vector. I am suppressing the appearance of the data.

The (untruncated) 2nd-order Taylor expansion of the log-likelihood viewed as a function of $theta^{left(m+1right)}$, $ellleft(theta^{left(m+1right)}right)$, centered at $theta^{left(mright)}$ is (in a bit more compact notation than the one used by the OP)

begin{align} ellleft(theta^{left(m+1right)}right) =& ellleft(theta^{left(mright)}right)+frac{partialellleft(theta^{left(mright)}right)}{partialtheta}left(theta^{left(m+1right)}-theta^{left(mright)}right)\ +&frac{1}{2}left(theta^{left(m+1right)}-theta^{left(mright)}right)^{prime}frac{partial^{2}ellleft(theta^{left(mright)}right)}{partialthetapartialtheta^{prime}}left(theta^{left(m+1right)}-theta^{left(mright)}right)\ +&R_2left(theta^{left(m+1right)}right) \end{align} The derivative of the log-likelihood is (using the properties of matrix differentiation)

$$frac{partial}{partial theta^{left(m+1right)}}ellleft(theta^{left(m+1right)}right) = frac{partialellleft(theta^{left(mright)}right)}{partialtheta} +frac{partial^{2}ellleft(theta^{left(mright)}right)}{partialthetapartialtheta^{prime}}left(theta^{left(m+1right)}-theta^{left(mright)}right) +frac{partial}{partial theta^{left(m+1right)}}R_2left(theta^{left(m+1right)}right)$$

Assume that we require that $$frac{partial}{partial theta^{left(m+1right)}}ellleft(theta^{left(m+1right)}right)- frac{partial}{partial theta^{left(m+1right)}}R_2left(theta^{left(m+1right)}right)=mathbf 0$$

Then we obtain $$theta^{left(m+1right)}=theta^{left(mright)}-left[mathbb{H}left(theta^{left(mright)}right)right]^{-1}left[mathbb{S}left(theta^{left(mright)}right)right]$$

This last formula shows how the value of the candidate $theta$ vector is updated in each step of the algorithm. And we also see how the updating rule was obtained:$theta^{left(m+1right)}$ must be chosen so as its total marginal effect on the log-likelihood equals its marginal effect on the Taylor remainder. In this way we "contain" how much the derivative of the log-likelihood strays away from the value zero.

If (and when) it so happens that $theta^{left(mright)} = hat theta$ we will obtain

$$theta^{left(m+1right)}=hat theta-left[mathbb{H}left(hat thetaright)right]^{-1}left[mathbb{S}left(hat thetaright)right]= hat theta-left[mathbb{H}left(hat thetaright)right]^{-1}cdot mathbf 0 = hat theta$$

since by construction $hat theta$ makes the gradient of the log-likelihood zero. This tells us that once we "hit" $hat theta$, we are not going anyplace else after that, which, in an intuitive way, validates our decision to essentially ignore the remainder, in order to calculate $theta^{left(m+1right)}$. If the conditions for quadratic convergence of the algorithm are met, we have essentially a contraction mapping, and the MLE estimate is the (or a) fixed point of it. Note that if $theta^{left(mright)} = hat theta$ then the remainder becomes also zero and then we have $$frac{partial}{partial theta^{left(m+1right)}}ellleft(theta^{left(m+1right)}right)- frac{partial}{partial theta^{left(m+1right)}}R_2left(theta^{left(m+1right)}right)=frac{partial}{partial theta}ellleft(hat thetaright)=mathbf 0$$

So our method is internally consistent.

DISTRIBUTION OF $hat theta$
To obtain the asymptotic distribution of the MLE estimator we apply the Mean Value theorem according to which, if the log-likelihood is continuous and differentiable, then

$$frac{partial}{partial theta}ellleft(hat thetaright) = frac{partialellleft(theta_0right)}{partialtheta} +frac{partial^{2}ellleft(bar thetaright)}{partialthetapartialtheta^{prime}}left(hat theta-theta_0right) = mathbf 0$$

where $bar theta$ is a mean value between $hat theta$ and $theta_0$. Then

$$left(hat theta-theta_0right) = -left[mathbb{H}left(bar thetaright)right]^{-1}left[mathbb{S}left( theta_0right)right]$$

$$Rightarrow sqrt nleft(hat theta-theta_0right) = -left[frac 1nmathbb{H}left(bar thetaright)right]^{-1}left[frac 1{sqrt n}mathbb{S}left( theta_0right)right]$$

Under the appropriate assumptions, the MLE is a consistent estimator. Then so is $bar theta$, since it is sandwiched between the MLE and the true value. Under the assumption that our data is stationary, and one more technical condition (a local dominance condition that guarantees that the expected value of the supremum of the Hessian in a neighborhood of the true value is finite) we have $$frac 1nmathbb{H}left(bar thetaright) rightarrow_p Eleft[mathbb{H}left(theta_0right)right]$$

Moreover, if interchange of integration and differentiation is valid (which usually will be), then $$Eleft[mathbb{S}left( theta_0right)right]=mathbf 0$$ This, together with the assumption that our data is i.i.d, permits us to use the Lindeberg-Levy CLT and conclude that $$left[frac 1{sqrt n}mathbb{S}left( theta_0right)right] rightarrow_d N(mathbf 0, Sigma),qquad Sigma = Eleft[mathbb{S}left( theta_0right)mathbb{S}left( theta_0right)'right]$$

and then, by applying Slutzky's Theorem, that $$Rightarrow sqrt nleft(hat theta-theta_0right) rightarrow_d Nleft(mathbf 0, operatorname{Avar}right)$$

with

$$operatorname{Avar} = Big(Eleft[mathbb{H}left(theta_0right)right]Big)^{-1}cdot Big(Eleft[mathbb{S}left( theta_0right)mathbb{S}left( theta_0right)'right]Big)cdot Big(Eleft[mathbb{H}left(theta_0right)right]Big)^{-1}$$

But the information matrix equality states that

$$-Big(Eleft[mathbb{H}left(theta_0right)right]Big) = Big(Eleft[mathbb{S}left( theta_0right)mathbb{S}left( theta_0right)'right]Big)$$

and so $$operatorname{Avar} = -Big(Eleft[mathbb{H}left(theta_0right)right]Big)^{-1} = Big(Eleft[mathbb{S}left( theta_0right)mathbb{S}left( theta_0right)'right]Big)^{-1}$$

Then for large samples the distribution of $hat theta$ is approximated by

$$hat theta sim _{approx} Nleft(theta_0, frac 1noperatorname {widehat Avar}right)$$

for a consistent estimator for $operatorname {widehat Avar}$ (the sample analogues of the expected values involved are such consistent estimators).

Rate this post