Solved – topic similarity semantic PMI between two words wikipedia

I am trying to compute pointwise mutual information (PMI) using wikipedia as data source. Given two words, PMI defines the relation between two words. The formula is as below.

pmi(word1,word2) = log [probability(number of times both words appears in a document together)/probability(word1)*probability(word2)].

Hence to compute PMI, I would need joint and individual probabilities of word1 and word2. I looked at the wikipedia miner relatedness score between two words. They are implementing a Milne and Witten algorithm. However, for defining topic similarities, PMI is a better score.

Does any one know how to compute PMI score for two words using dbpedia or wikipedia miner or any other software.

Ramki

You might compute PMI using Wikipedia, as following:

1) Using Lucene to index a Wikipedia dump

2) Using Lucene API, it is straightforward to get:

  • The number (N1) of documents containing word1 and the number (N2) of documents containing word2. So, Prob(word1) = (N1 + 1) / N and Prob(word2) = (N2 + 1) / N, where N is the total number of documents in Wikipedia and "1" in the formulas is used for avoiding zero counts.

  • The number of times (N3) both words appear in a document together. You can also set a strong constraint so that the two words appear inside a 10-word (or 20-word) window context. Similarly, Prob(word1, word2) = (N3 + 1) / N.

We have: PMI(word1, word2) = Log(Prob(word1, word2) / (Prob(word1) * Prob(word2)))

Furthermore, I would suggest you to have a look at the recent WSDM 2015 paper "Exploring the Space of Topic Coherence Measures" and its associated toolkit Palmetto (https://github.com/AKSW/Palmetto) which implements the topic coherence calculations. Palmetto contains implementations of PMI and other topic coherence scores.

Similar Posts:

Rate this post

Leave a Comment

Solved – topic similarity semantic PMI between two words wikipedia

I am trying to compute pointwise mutual information (PMI) using wikipedia as data source. Given two words, PMI defines the relation between two words. The formula is as below.

pmi(word1,word2) = log [probability(number of times both words appears in a document together)/probability(word1)*probability(word2)].

Hence to compute PMI, I would need joint and individual probabilities of word1 and word2. I looked at the wikipedia miner relatedness score between two words. They are implementing a Milne and Witten algorithm. However, for defining topic similarities, PMI is a better score.

Does any one know how to compute PMI score for two words using dbpedia or wikipedia miner or any other software.

Ramki

Best Answer

You might compute PMI using Wikipedia, as following:

1) Using Lucene to index a Wikipedia dump

2) Using Lucene API, it is straightforward to get:

  • The number (N1) of documents containing word1 and the number (N2) of documents containing word2. So, Prob(word1) = (N1 + 1) / N and Prob(word2) = (N2 + 1) / N, where N is the total number of documents in Wikipedia and "1" in the formulas is used for avoiding zero counts.

  • The number of times (N3) both words appear in a document together. You can also set a strong constraint so that the two words appear inside a 10-word (or 20-word) window context. Similarly, Prob(word1, word2) = (N3 + 1) / N.

We have: PMI(word1, word2) = Log(Prob(word1, word2) / (Prob(word1) * Prob(word2)))

Furthermore, I would suggest you to have a look at the recent WSDM 2015 paper "Exploring the Space of Topic Coherence Measures" and its associated toolkit Palmetto (https://github.com/AKSW/Palmetto) which implements the topic coherence calculations. Palmetto contains implementations of PMI and other topic coherence scores.

Similar Posts:

Rate this post

Leave a Comment