I am reading the < Foundations of Statistical Natural Language Processing >. It has the following statement about the relationship between information entropy and language model:

…The essential point here is that if a model captures more of the

structure of a language, then the entropy of the model should be

lower. In other words, we can use entropy as a measure of the quality

of our models…

But how about this example:

Suppose we have a machine that spit $2$ characters, A and B, one by one. And the designer of the machine makes A and B has the equal probability.

I am not the designer. And I try to model it through experiment.

During a initial experiment, I see the machine split the following character sequence:

A, B, A

So I model the machine as $P(A)=frac{2}{3}$ and $P(B)=frac{1}{3}$. And we can calculate entropy of this model as :

$$

frac{-2}{3}cdotlog{frac{2}{3}}-frac{1}{3}cdotlog{frac{1}{3}}= 0.918quadtext{bit}

$$

(the base is $2$ so the unit is bit)

But then, the designer tell me about his design, so I refined my model with this more information. The new model looks like this:

$P(A)=frac{1}{2}$ $P(B)=frac{1}{2}$

And the entropy of this new model is:

$$

frac{-1}{2}cdotlog{frac{1}{2}}-frac{1}{2}cdotlog{frac{1}{2}} = 1quadtext{bit}

$$

The second model is obviously better than the first one. But the entropy increased.

My point is, due to the arbitrariness of the model being tried, we cannot blindly say a smaller entropy indicates a better model.

Could anyone shed some light on this?

**Contents**hide

#### Best Answer

(For more info, please check here: https://stackoverflow.com/questions/22933412/why-can-we-use-entropy-to-measure-the-quality-of-language-model)

After I re-digested the mentioned NLP book. I think I can explain it now.

What I calculated is actually the entropy of the language model distribution. It cannot be used to evaluate the effectiveness of a language model.

To evaluate a language model, we should measure how much surprise it gives us for real sequences in that language. For each real word encountered, the language model will give a probability p. And we use -log(p) to quantify the surprise. And we average the total surprise over a long enough sequence. So, in case of a 1000-letter sequence with 500 A and 500 B, the surprise given by the 1/3-2/3 model will be:

[-500*log(1/3) – 500*log(2/3)]/1000 = 1/2 * Log(9/2)

While the correct 1/2-1/2 model will give:

[-500*log(1/2) – 500*log(1/2)]/1000 = 1/2 * Log(8/2)

So, we can see, the 1/3, 2/3 model gives more surprise, which indicates it is worse than the correct model.

Only when the sequence is long enough, the average effect will mimic the expectation over the 1/2-1/2 distribution. If the sequence is short, it won't give a convincing result.

I didn't mention the cross-entropy here since I think this jargon is too intimidating and not much helpful to reveal the root cause.

## ADD 1

I'd like to share a bit more about *my understanding* of the **surprise** mentioned in last paragraph.

The word **surprise** here is actually a synonym to the **information gain**. But it is very **abstract and subjective** say how much information we gained through some experience. So we need a **concrete and objective** measure of that. This measure is the $-log(p)$.

Well, among so many mathematical options, why do we choose this very function? I read about Psychophysics where it mentions the **logarithmic rule** that *relate physical stimuli to the contents of consciousness*. So I think this explains why we choose the $log(p)$. As to why we add `-`

to it, I think it's because human tends to use positive numbers as measure, such as length, area. And probability is a value between [0-1], which leads to negative values in raw $log(p)$ function. Also, this measure design is consistent with our common sense that **less possible things give more info/surprise**. So that's it.

Put it simply, we use **mathematics** to describe/model the world. And mathematics just reflect our **instinct** ultimately.