I try to get an intuitive understanding of the convergence properties of the EM-Algorithm. I wrote a code that does the following experiment. We are given three coins: $H$, $A$ and $B$; with probabilities of getting head with $lambda$ for $H$, $p_A$ for $A$ and $p_B$ for $B$. Now, I flip $N$-Times $H$ and if it shows heads, I will flip $M$-times coin $A$; if, however, $H$ shows tails, I will $M$-times flip coin $B$. So, we end up with a $N mathrm{x} M$ data matrix. In the most cases I set $N = 1000$ and $M = 20$.

On this data set, I run the EM-Algorithm. However, for all parameter settings I have used so far for generating the data, the algorithm needs less than $10$ iterations to converge for various initial values. Note, my convergence criterion is that the absolute value of the difference of the log-likelihoods after an iteration is less than $0.001$. Is such a fast convergence always the case or is it possible to set the values for the data generation step and the initial values in the algorithm such that there will be no convergence or a very long time to converge? Also, is it possible to derive some formula where the convergence time depends on $lambda$, $p_A$ and $p_B$ and the initial values in the algorithm?

Thanks for your help!

I found the described experiment on this website.

**Contents**hide

#### Best Answer

Many studies on the speed of convergence, choice of starting values, and stopping criteria for the EM algorithm exist, however, I have never seen a formula so specific like the one you are looking for, not even in your simple example.

The speed of convergence in terms of number of iterations (retaining as fixed the model and the algorithm) depends on:

- the shape of the loglikelihood (which depends solely on the observed sufficient statistics)
- the starting values
- the stopping criterion.

Note that the EM is just an algorithm to maximize the loglikelihood, so the speed of convergence is only indirectly affected by the true parameters.

So, yes: in general, different initial values lead to different time of convergence. Note that the EM algorithm is slower where the likelihood is flat, so two starting points with the same distance from the maximum can lead to different speed of convergence depending on the shape.

Starting values on the boundary of the parameter space should be avoided, as they are likely to lead to slow convergence or issues in the algorithm. For example, in your setting, if I write a simple algorithm and choose exactly 0 or 1 as starting values for any parameter I get an error.

The EM obviously stops in local maxima. So, different starting values can lead to different "wrong" solutions.

As regards to the stopping criterion, a small change in the likelihood does not guarantee that the (possibly local) optimum is near. So it may be the case that the algorithm stops in few iterations even if the real maximum is far.

So, in my opinion, you did not find much difference in the speed of convergence because: -your example is simple: (the loglikelihood is always unimodal), and -your threshold is too high. I tried your scenario with a stricter threshold ($10^{-6}$) and experienced greater variability in the number of iterations (form 20 up to 200).

(Btw, the absolute value of change in the loglikelihood is not useful as it depends on the sample size. Usually the relative change $frac{|l^{(t)} – l^{(t+1)}|}{l^{(t)}}$ is preferred).

If you want a good starting point, method-of-moments estimates, when available, are the best choice for finite mixture models (according to B.Lindsay) and it happens that such an estimator exists for your example: since you have essentially a mixture of two Binomial distributions, you can look at: "Moment estimators for the parameters of a mixture of two binomial distributions" Blischke 1962.

### Similar Posts:

- Solved – Convergence Time of the EM Algorithm Depending on the Inital Parameter Values
- Solved – Convergence Time of the EM Algorithm Depending on the Inital Parameter Values
- Solved – Convergence proof for perceptron algorithm with margin
- Solved – Rate of convergence of EM algorithm
- Solved – Specifying starting values/modes for K-modes Clustering