I am thinking about whether one needs to normalize or weight a topic

model by document length (page length)?

I am estimating a topic model using social science (JSTOR) articles, where

they vary in length between min 5 pages to max 200 pages. I want to analyse

a specific topic, namely, the degree of economic topics in social science

articles.

I can see that a similar question was raised back in 2011, but no clear suggestion was reached, as far as I can interpret the discussion:

https://lists.cs.princeton.edu/pipermail/topic-models/2011-February/001171.html

My intuition about this question is somewhat split here.

On the one hand it seems logical that one needs to weight by document

length since larger documents (a 200 page long document) will have more

pages to refer to a specific topic (in my case “economic) than shorter

document (a 5 pages long documents). This will be reflected for example in

a document-term-matrix where economic terms (e.g. markets, business, and

industry) will have much higher frequency in row for the 200 page document

compared to the row of the 5 page document. Moreover, the 200 page document will affect the overall term distribution of words. In other words, the terms of the 200-page document will dominate the term per document ratio for each and every term in the document-term-matrix.

On the other hand, the topic-term ratio seems to adjust for the fact that

we have longer and shorter documents in the sample. Even if the

term-document ratio is high for longer documents and lower for shorter

documents, the relative frequency (proportion of various terms for the

longer documents is comparable with the shorter. For example, the shorter

document might have a sum of 10 for the frequency (tokens) of economic

topics of a total of 30 tokens: gives an economic topic probability of

10/30. Whereas the longer document might have a sum of 100 for the

frequency of economic topics of a total of 3000 tokens (all topics): a

ratio of 100/3000.

Accordingly, even if the shorter document has fewer tokens for economic

topics than the longer document, it is still estimated to be more economic

than the longer document.

I am not sure what to conclude from this: can I trust a

page-unadjusted LDA results? I am using package topicmodels in R.

Many thanks in advance for your input

**Contents**hide

#### Best Answer

I haven't used topic models much, but I can say that if you are to apply usual clustering methods to un-normalized document-term matrices (even when the dimensionality of the data is reduced with LSA), you'll see that longer articles will tent to cluster together, just because they have more words.

So you may take a look at some of your topics and see if documents inside make sense. Also, try to calculate average length of document per topic and see if the phenomenon I mention takes place or not.

Then you can repeat the same on the unit-normalized data and see if the results make more sense or not.

### Similar Posts:

- Solved – Using topic words generated by LDA to represent a document
- Solved – LDA: find percentage / number of documents per topic
- Solved – Given an LDA model, how can I calculate p(word|topic,party), where each document belongs to a party
- Solved – Document Similarity Gensim
- Solved – Clustering of documents that are very different in number of words