I was taking a machine learning course and they say the following two phrases that confuse me:

each document is a distribution on topics.

and

each topic is a distribution on words.

I was wondering if someone knew what that meant.

Here is a link to the notes:

http://people.csail.mit.edu/moitra/docs/bookex.pdf

Currently this is how I interpret it (my thoughts).

Well, we are modeling a topic as a vector $u^{(i)}$ with the relative frequencies of each word. So it just specifies how often each word appears in a specific topic. Also, each document can approximately thought of as a linear combination of these topic vectors, i.e. document $M_{j} = sum^{r}_{i=1} w_{i}u^{(i)}$

thought I wasn't sure if that was right or how to include the concept of "distribution" to this.

**Contents**hide

#### Best Answer

Typically, in the context of Latent Dirichlet Allocation (used for Topic Modeling), we assume that the documents come from a generative process. I'll avoid math notation. Look at this figure:

(1) Every topic is generated from a Dirichlet distribution of $V$ dimensions where $V$ is the size of your vocabulary.

(2) For every document:

- (2.1) Generate a distribution over topics from a Dirichlet distribution of $T$ dimensions where $T$ is the number of topics in the corpus.
- (2.2) For every word in the document:
- (2.2.1) Choose a topic according to the distribution generated at (2.1)
- (2.2.2) Choose a word according to the distribution corresponding to the chosen topic (generated at (1))

The rigorous mathematical explanaition is here (section 3).

So, each topic is a probability distribution over the words of the vocabulary (1) because it says the probability, in that topic, of the word "dog" to appear.

And each document has a probability distribution over topics (2.1) which says from which topics the document is more likely to draw its words. We say that a document is a *mixture of topics*

**Note:**

- A Dirichlet distribution of three dimensions draws thinks like [0.2,0.4,0.4], [0.3,0.3,0.4], etc. which can be used as Categorical distributions. This is why it is used to generate distributions over $V$ words (topics), and distributions over $T$ topics. See left and right sides of the figure.

### Similar Posts:

- Solved – Does Latent Dirchlet Allocation Work with Bag Of Words Model
- Solved – Can LDA assign more than one topic for a word
- Solved – Hierarchical Dirichlet Processes in topic modeling
- Solved – Natural interpretation for LDA hyperparameters
- Solved – Given a topic distribution over words from LDA model how to calculate document distribution over topics for new document