In the EM algorithm approach we use Jensen's inequality to arrive at $$log p(x|theta) geq int log p(z,x|theta) p(z|x,theta^{(k)}) dz – int log p(z|x,theta) p(z|x,theta^{(k)})dz$$

and define $theta^{(k+1)}$ by $$theta^{(k+1)}=arg max_{theta}int log p(z,x|theta) p(z|x,theta^{(k)}) dz$$

Everything I read EM just plops it down but I've always felt uneasy by not having an explanation of why the EM algorithm arises naturally. I understand that $log$ likelihood is typically dealt with to deal with addition instead of multiplication but the appearance of $log$ in the definition of $theta^{(k+1)}$ feels unmotivated to me. Why should one consider $log$ and not other monotonic functions? For various reasons I suspect that the "meaning" or "motivation" behind expectation maximization has some kind of explanation in terms of information theory and sufficient statistics. If there were such an explanation that would be much more satisifying than just an abstract algorithm.

**Contents**hide

#### Best Answer

**Likelihood vs. log-likelihood**

As has already been said, the $log$ is introduced in maximum likelihood simply because it is generally easier to optimize sums than products. The reason we don't consider other monotonic functions is that the logarithm is the **unique** function with the property of turning products into sums.

Another way to motivate the logarithm is the following: Instead of maximizing the probability of the data under our model, we could equivalently try to minimize the Kullback-Leibler divergence between the data distribution, $p_text{data}(x)$, and the model distribution, $p(x mid theta)$,

$$D_text{KL}[p_text{data}(x) midmid p(x mid theta)] = int p_text{data}(x) log frac{p_text{data}(x)}{p(x mid theta)} , dx = const – int p_text{data}(x)log p(x mid theta) , dx.$$

The first term on the right-hand side is constant in the parameters. If we have $N$ samples from the data distribution (our data points), we can approximate the second term with the average log-likelihood of the data,

$$int p_text{data}(x)log p(x mid theta) , dx approx frac{1}{N} sum_n log p(x_n mid theta).$$

**An alternative view of EM**

I am not sure this is going to be the kind of explanation you are looking for, but I found the following view of expectation maximization much more enlightening than its motivation via Jensen's inequality (you can find a detailed description in Neal & Hinton (1998) or in Chris Bishop's PRML book, Chapter 9.3).

It is not difficult to show that

$$log p(x mid theta) = int q(z mid x) log frac{p(x, z mid theta)}{q(z mid x)} , dz + D_text{KL}[q(z mid x) midmid p(z mid x, theta)]$$

for any $q(z mid x)$. If we call the first term on the right-hand side $F(q, theta)$, this implies that

$$F(q, theta) = int q(z mid x) log frac{p(x, z mid theta)}{q(z mid x)} , dz = log p(x mid theta) – D_text{KL}[q(z mid x) midmid p(z mid x, theta)].$$

Because the KL divergence is always positive, $F(q, theta)$ is a lower bound on the log-likelihood for every fixed $q$. Now, EM can be viewed as alternately maximizing $F$ with respect to $q$ and $theta$. In particular, by setting $q(z mid x) = p(z mid x, theta)$ in the E-step, we minimize the KL divergence on the right-hand side and thus maximize $F$.