# Solved – Why can we use entropy to measure the quality of a language model

I am reading the < Foundations of Statistical Natural Language Processing >. It has the following statement about the relationship between information entropy and language model:

…The essential point here is that if a model captures more of the
structure of a language, then the entropy of the model should be
lower. In other words, we can use entropy as a measure of the quality
of our models…

Suppose we have a machine that spit \$2\$ characters, A and B, one by one. And the designer of the machine makes A and B has the equal probability.

I am not the designer. And I try to model it through experiment.

During a initial experiment, I see the machine split the following character sequence:

A, B, A

So I model the machine as \$P(A)=frac{2}{3}\$ and \$P(B)=frac{1}{3}\$. And we can calculate entropy of this model as :
\$\$
\$\$
(the base is \$2\$ so the unit is bit)

But then, the designer tell me about his design, so I refined my model with this more information. The new model looks like this:

\$P(A)=frac{1}{2}\$ \$P(B)=frac{1}{2}\$

And the entropy of this new model is:
\$\$
\$\$
The second model is obviously better than the first one. But the entropy increased.

My point is, due to the arbitrariness of the model being tried, we cannot blindly say a smaller entropy indicates a better model.

Could anyone shed some light on this?

Contents

After I re-digested the mentioned NLP book. I think I can explain it now.

What I calculated is actually the entropy of the language model distribution. It cannot be used to evaluate the effectiveness of a language model.

To evaluate a language model, we should measure how much surprise it gives us for real sequences in that language. For each real word encountered, the language model will give a probability p. And we use -log(p) to quantify the surprise. And we average the total surprise over a long enough sequence. So, in case of a 1000-letter sequence with 500 A and 500 B, the surprise given by the 1/3-2/3 model will be:

[-500log(1/3) – 500log(2/3)]/1000 = 1/2 * Log(9/2)

While the correct 1/2-1/2 model will give:

[-500log(1/2) – 500log(1/2)]/1000 = 1/2 * Log(8/2)

So, we can see, the 1/3, 2/3 model gives more surprise, which indicates it is worse than the correct model.

Only when the sequence is long enough, the average effect will mimic the expectation over the 1/2-1/2 distribution. If the sequence is short, it won't give a convincing result.

I didn't mention the cross-entropy here since I think this jargon is too intimidating and not much helpful to reveal the root cause.

The word surprise here is actually a synonym to the information gain. But it is very abstract and subjective say how much information we gained through some experience. So we need a concrete and objective measure of that. This measure is the $$-log(p)$$.
Well, among so many mathematical options, why do we choose this very function? I read about Psychophysics where it mentions the logarithmic rule that relate physical stimuli to the contents of consciousness. So I think this explains why we choose the $$log(p)$$. As to why we add `-` to it, I think it's because human tends to use positive numbers as measure, such as length, area. And probability is a value between [0-1], which leads to negative values in raw $$log(p)$$ function. Also, this measure design is consistent with our common sense that less possible things give more info/surprise. So that's it.