# Solved – Hierarchical Dirichlet Processes in topic modeling

I think I understand the main ideas of hierarchical dirichlet processes, but I don't understand the specifics of its application in topic modeling. Basically, the idea is that we have the following model:

\$\$G_{0}sim DP(gamma, H)\$\$
\$\$G_{j}sim DP(alpha_{0}, G_{0})\$\$
\$\$phi_{ji} sim G_{j}\$\$
\$\$x_{ji} sim F(phi_{ji})\$\$

We sample from a Dirichlet process with a base distribution \$H\$ to obtain a discrete distribution \$G_{0}\$. Then, we use \$G_{0}\$ in another Dirichlet process \$G_{j}\$ for every \$j\$ (in topic modeling, \$j\$ is supposed to represent documents and \$G_{j}\$ is a distribution over topics for document \$j\$). After this, for each word in document \$j\$, sample from \$G_{j}\$ in order to select a particular topic. Some sources say that this is parameter associated to the topic and not properly a topic. In any case, this is acting as a latent variable. Finally, for each document \$j\$ and word \$i\$, \$x_{ji}\$ is described as a distribution \$F\$ that depends on the latent variable \$phi_{ji}\$ associated in some way to the selected topic.

The question is: How do you describe explicitly \$F(phi_{ji})\$? I think I have seen a multinomial distribution there, but I'm not sure about it. As a comparison, in LDA we need for each topic a distribution over words and a multinomial distribution is required. What is the equivalent procedure here and what it represents in terms of words, documents and topics?

Contents

I found this truly excellent review that describes precisely how Hierarchical Dirichlet Processes work.

First, start by choosing a base distribution \$H\$. In the case of topic modeling, we have a Dirichlet distribution as a prior for \$H\$. The dimension of this distribution should describe a distribution of words for each topic. Therefore, the dimension should be equal to the size of the vocabulary \$V\$. In the example described in the review, the author assumes a vocabulary of 10 words, so he uses \$H = text{Dirichlet}(1/10,…,1/10)\$. As usual, a realization of this distribution generates a 10-dimensional vector \$theta_{k}\$ of proportions.

After this, \$H\$ is used to build a Dirichlet Process \$DP(gamma, H)\$ and a realization \$G_{0}\$ of this process is another discrete distribution with locations \${theta_{k}}\$ where each \$theta_{k}\$ describes the distribution over words for a topic \$k\$. If we use \$G_{0}\$ as a base distribution for another Dirichlet Process \$DP(alpha_{0}, G_{0})\$, it is possible to obtain a realization \$G_{j}\$ for every document \$j\$ in such a way that \$G_{j}\$ has the same support as \$G_{0}\$. Therefore, every \$G_{j}\$ shares the same set of \$theta_{k}\$'s, although with different proportions (which are called mixing weights in the definition of a Dirichlet Process)

Finally, for every document \$j\$ and every word \$i\$, we draw a realization from \$G_{j}\$ which generates a particular vector \$theta_{k}\$. Since this \$theta_{k}\$ is a distribution over words for a given topic, we only need to sample from a multinomial distribution using \$theta_{k}\$ as parameter in order to sample words \$w_{ji}= text{Multinomial}(theta_{k})\$.

I have seen that sometimes \$phi_{ji}\$ is defined as \$phi_{ji}=theta_{k}\$ for every document \$j\$ and word \$i\$. Sometimes, it is easier to use a variable \$z_{ji}\$ that works as an index to sample from the probabilities \$pi_{jk}\$ of \$G_{j}\$ (in \$G_{j} = sum_{k=1}^{infty} pi_{jk} delta_{theta_{k}}\$) and then used as in \$theta_{z_{ji}}\$. However, I think this is done in the context of the stick-breaking construction.

Rate this post