Solved – Maximum Likelihood Estimation of Dirichlet Mean

Consider the problem of computing a Maximum-Likelihood estimate of the parameters to a finite Dirichlet distribution, given a set of multinomial observations (probability vectors) assumed to have been sampled from a Dirichlet. The following paper provides an iterative fixed-point algorithm to estimate the mean and precision of the Dirichlet separately:

Minka, Thomas. Estimating a Dirichlet distribution. (2000): 3.

The algorithm for estimating the mean \$mathbf{m}\$, given fixed precision \$s\$, is summarized as follows:

1. Estimate the full concentration parameters \$pmb{alpha}=smathbf{m}\$ by inverting the digamma function.
2. \$forall, k\$, set \$m_k^{new} = frac{alpha_k}{sum_j alpha_j}\$.
3. Repeat until convergence.

Why must we resort to an iterative algorithm to find the mean? Is the average of our observed data vectors not an accurate estimate of the mean? Further, is it also not true that the expected value of the mean of a set of samples from a Dirichlet is the mean of the Dirichlet itself?

Any insight here would be appreciated!

Contents

Suppose \$mathbf p_1, ldots, mathbf p_n\$ are iid \$operatorname{Dirichlet}(s mathbf m)\$. If I'm understanding you correctly, your question is "why use an iterative scheme when \$hat {mathbf m} = frac 1 N sum_{i = 1} ^ N mathbf p_i\$ works?" You are correct that this is a reasonable estimator. But it isn't the maximum likelihood estimator, which is what we care about! The Dirichlet likelihood is \$\$ L_i(pmb alpha) = frac{Gamma(sum_k alpha_k)}{prod_kGamma(alpha_k)} prod_k p_{ik}^{alpha_k – 1} \$\$ so our goal is to maximize \$prod_i L_i (pmb alpha)\$ in \$pmb alpha\$; once we do this, we can get the maximum likelihood estimate of \$mathbf m\$ by normalizing. But it is easy to see that the likelihood is a function of \$frac 1 N sum_i log mathbf p_i\$ rather than \$frac 1 N sum_i mathbf p_i\$ (I'm using \$log\$ elementwise here). In some sense, we might think of \$log mathbf p_i\$ as the "appropriate scale" of the data – at least, for the Dirichlet distribution – rather than the untransformed \$mathbf p_i\$.

So, we believe that the MLE is not \$frac 1 N sum_i mathbf p_i\$ but rather is some complicated function of \$frac 1 N sum_i log mathbf p_i\$. The question now becomes "why use the MLE rather than the easy estimator?" Well, we have some theorems which say the MLE has certain optimality properties. So, we get a more efficient estimator with the MLE, although \$frac 1 N sum_i mathbf p_i\$ may still be useful as a starting point for the iterative algorithm. Now, I'm not sure how good the MLE really is here, considering that the data must actually be Dirichlet distributed for it to work, whereas \$frac 1 N sum mathbf p_i\$ is consistent no matter what. But that is another story.

Rate this post