I am trying to compute pointwise mutual information (PMI) using wikipedia as data source. Given two words, PMI defines the relation between two words. The formula is as below.

`pmi(word1,word2) = log [probability(number of times both words appears in a document together)/probability(word1)*probability(word2)].`

Hence to compute PMI, I would need joint and individual probabilities of word1 and word2. I looked at the wikipedia miner relatedness score between two words. They are implementing a Milne and Witten algorithm. However, for defining topic similarities, PMI is a better score.

Does any one know how to compute PMI score for two words using dbpedia or wikipedia miner or any other software.

Ramki

**Contents**hide

#### Best Answer

You might compute PMI using Wikipedia, as following:

1) Using Lucene to index a Wikipedia dump

2) Using Lucene API, it is straightforward to get:

The number (N1) of documents containing word1 and the number (N2) of documents containing word2. So, Prob(word1) = (N1 + 1) / N and Prob(word2) = (N2 + 1) / N, where N is the total number of documents in Wikipedia and "1" in the formulas is used for avoiding zero counts.

The number of times (N3) both words appear in a document together. You can also set a strong constraint so that the two words appear inside a 10-word (or 20-word) window context. Similarly, Prob(word1, word2) = (N3 + 1) / N.

We have: PMI(word1, word2) = Log(Prob(word1, word2) / (Prob(word1) * Prob(word2)))

Furthermore, I would suggest you to have a look at the recent WSDM 2015 paper "Exploring the Space of Topic Coherence Measures" and its associated toolkit Palmetto (https://github.com/AKSW/Palmetto) which implements the topic coherence calculations. Palmetto contains implementations of PMI and other topic coherence scores.

### Similar Posts:

- Solved – topic similarity semantic PMI between two words wikipedia
- Solved – topic similarity semantic PMI between two words wikipedia
- Solved – How does topic coherence score in LDA intuitively makes sense
- Solved – Running Latent Dirichlet Allocation (LDA) on word counts
- Solved – Are there differences between Delta TF-IDF and TF-IDF