Solved – Why can we use entropy to measure the quality of a language model

I am reading the < Foundations of Statistical Natural Language Processing >. It has the following statement about the relationship between information entropy and language model:

…The essential point here is that if a model captures more of the
structure of a language, then the entropy of the model should be
lower. In other words, we can use entropy as a measure of the quality
of our models…

But how about this example:

Suppose we have a machine that spit $2$ characters, A and B, one by one. And the designer of the machine makes A and B has the equal probability.

I am not the designer. And I try to model it through experiment.

During a initial experiment, I see the machine split the following character sequence:

A, B, A

So I model the machine as $P(A)=frac{2}{3}$ and $P(B)=frac{1}{3}$. And we can calculate entropy of this model as :
frac{-2}{3}cdotlog{frac{2}{3}}-frac{1}{3}cdotlog{frac{1}{3}}= 0.918quadtext{bit}
(the base is $2$ so the unit is bit)

But then, the designer tell me about his design, so I refined my model with this more information. The new model looks like this:

$P(A)=frac{1}{2}$ $P(B)=frac{1}{2}$

And the entropy of this new model is:
frac{-1}{2}cdotlog{frac{1}{2}}-frac{1}{2}cdotlog{frac{1}{2}} = 1quadtext{bit}
The second model is obviously better than the first one. But the entropy increased.

My point is, due to the arbitrariness of the model being tried, we cannot blindly say a smaller entropy indicates a better model.

Could anyone shed some light on this?

(For more info, please check here:

After I re-digested the mentioned NLP book. I think I can explain it now.

What I calculated is actually the entropy of the language model distribution. It cannot be used to evaluate the effectiveness of a language model.

To evaluate a language model, we should measure how much surprise it gives us for real sequences in that language. For each real word encountered, the language model will give a probability p. And we use -log(p) to quantify the surprise. And we average the total surprise over a long enough sequence. So, in case of a 1000-letter sequence with 500 A and 500 B, the surprise given by the 1/3-2/3 model will be:

[-500log(1/3) – 500log(2/3)]/1000 = 1/2 * Log(9/2)

While the correct 1/2-1/2 model will give:

[-500log(1/2) – 500log(1/2)]/1000 = 1/2 * Log(8/2)

So, we can see, the 1/3, 2/3 model gives more surprise, which indicates it is worse than the correct model.

Only when the sequence is long enough, the average effect will mimic the expectation over the 1/2-1/2 distribution. If the sequence is short, it won't give a convincing result.

I didn't mention the cross-entropy here since I think this jargon is too intimidating and not much helpful to reveal the root cause.


I'd like to share a bit more about my understanding of the surprise mentioned in last paragraph.

The word surprise here is actually a synonym to the information gain. But it is very abstract and subjective say how much information we gained through some experience. So we need a concrete and objective measure of that. This measure is the $-log(p)$.

Well, among so many mathematical options, why do we choose this very function? I read about Psychophysics where it mentions the logarithmic rule that relate physical stimuli to the contents of consciousness. So I think this explains why we choose the $log(p)$. As to why we add - to it, I think it's because human tends to use positive numbers as measure, such as length, area. And probability is a value between [0-1], which leads to negative values in raw $log(p)$ function. Also, this measure design is consistent with our common sense that less possible things give more info/surprise. So that's it.

Put it simply, we use mathematics to describe/model the world. And mathematics just reflect our instinct ultimately.

Similar Posts:

Rate this post

Leave a Comment