I was taking a machine learning course and they say the following two phrases that confuse me:
each document is a distribution on topics.
and
each topic is a distribution on words.
I was wondering if someone knew what that meant.
Here is a link to the notes:
http://people.csail.mit.edu/moitra/docs/bookex.pdf
Currently this is how I interpret it (my thoughts).
Well, we are modeling a topic as a vector $u^{(i)}$ with the relative frequencies of each word. So it just specifies how often each word appears in a specific topic. Also, each document can approximately thought of as a linear combination of these topic vectors, i.e. document $M_{j} = sum^{r}_{i=1} w_{i}u^{(i)}$
thought I wasn't sure if that was right or how to include the concept of "distribution" to this.
Best Answer
Typically, in the context of Latent Dirichlet Allocation (used for Topic Modeling), we assume that the documents come from a generative process. I'll avoid math notation. Look at this figure:
(1) Every topic is generated from a Dirichlet distribution of $V$ dimensions where $V$ is the size of your vocabulary.
(2) For every document:
- (2.1) Generate a distribution over topics from a Dirichlet distribution of $T$ dimensions where $T$ is the number of topics in the corpus.
- (2.2) For every word in the document:
- (2.2.1) Choose a topic according to the distribution generated at (2.1)
- (2.2.2) Choose a word according to the distribution corresponding to the chosen topic (generated at (1))
The rigorous mathematical explanaition is here (section 3).
So, each topic is a probability distribution over the words of the vocabulary (1) because it says the probability, in that topic, of the word "dog" to appear.
And each document has a probability distribution over topics (2.1) which says from which topics the document is more likely to draw its words. We say that a document is a mixture of topics
Note:
- A Dirichlet distribution of three dimensions draws thinks like [0.2,0.4,0.4], [0.3,0.3,0.4], etc. which can be used as Categorical distributions. This is why it is used to generate distributions over $V$ words (topics), and distributions over $T$ topics. See left and right sides of the figure.
Similar Posts:
- Solved – Does Latent Dirchlet Allocation Work with Bag Of Words Model
- Solved – Can LDA assign more than one topic for a word
- Solved – Hierarchical Dirichlet Processes in topic modeling
- Solved – Natural interpretation for LDA hyperparameters
- Solved – Given a topic distribution over words from LDA model how to calculate document distribution over topics for new document