Solved – Hierarchical Dirichlet Processes in topic modeling

I think I understand the main ideas of hierarchical dirichlet processes, but I don't understand the specifics of its application in topic modeling. Basically, the idea is that we have the following model:

$$G_{0}sim DP(gamma, H)$$
$$G_{j}sim DP(alpha_{0}, G_{0})$$
$$phi_{ji} sim G_{j}$$
$$x_{ji} sim F(phi_{ji})$$

We sample from a Dirichlet process with a base distribution $H$ to obtain a discrete distribution $G_{0}$. Then, we use $G_{0}$ in another Dirichlet process $G_{j}$ for every $j$ (in topic modeling, $j$ is supposed to represent documents and $G_{j}$ is a distribution over topics for document $j$). After this, for each word in document $j$, sample from $G_{j}$ in order to select a particular topic. Some sources say that this is parameter associated to the topic and not properly a topic. In any case, this is acting as a latent variable. Finally, for each document $j$ and word $i$, $x_{ji}$ is described as a distribution $F$ that depends on the latent variable $phi_{ji}$ associated in some way to the selected topic.

The question is: How do you describe explicitly $F(phi_{ji})$? I think I have seen a multinomial distribution there, but I'm not sure about it. As a comparison, in LDA we need for each topic a distribution over words and a multinomial distribution is required. What is the equivalent procedure here and what it represents in terms of words, documents and topics?

I found this truly excellent review that describes precisely how Hierarchical Dirichlet Processes work.

First, start by choosing a base distribution $H$. In the case of topic modeling, we have a Dirichlet distribution as a prior for $H$. The dimension of this distribution should describe a distribution of words for each topic. Therefore, the dimension should be equal to the size of the vocabulary $V$. In the example described in the review, the author assumes a vocabulary of 10 words, so he uses $H = text{Dirichlet}(1/10,…,1/10)$. As usual, a realization of this distribution generates a 10-dimensional vector $theta_{k}$ of proportions.

After this, $H$ is used to build a Dirichlet Process $DP(gamma, H)$ and a realization $G_{0}$ of this process is another discrete distribution with locations ${theta_{k}}$ where each $theta_{k}$ describes the distribution over words for a topic $k$. If we use $G_{0}$ as a base distribution for another Dirichlet Process $DP(alpha_{0}, G_{0})$, it is possible to obtain a realization $G_{j}$ for every document $j$ in such a way that $G_{j}$ has the same support as $G_{0}$. Therefore, every $G_{j}$ shares the same set of $theta_{k}$'s, although with different proportions (which are called mixing weights in the definition of a Dirichlet Process)

Finally, for every document $j$ and every word $i$, we draw a realization from $G_{j}$ which generates a particular vector $theta_{k}$. Since this $theta_{k}$ is a distribution over words for a given topic, we only need to sample from a multinomial distribution using $theta_{k}$ as parameter in order to sample words $w_{ji}= text{Multinomial}(theta_{k})$.

I have seen that sometimes $phi_{ji}$ is defined as $phi_{ji}=theta_{k}$ for every document $j$ and word $i$. Sometimes, it is easier to use a variable $z_{ji}$ that works as an index to sample from the probabilities $pi_{jk}$ of $G_{j}$ (in $G_{j} = sum_{k=1}^{infty} pi_{jk} delta_{theta_{k}}$) and then used as in $theta_{z_{ji}}$. However, I think this is done in the context of the stick-breaking construction.

Similar Posts:

Rate this post

Leave a Comment