I am reading the < Foundations of Statistical Natural Language Processing >. It has the following statement about the relationship between information entropy and language model:
…The essential point here is that if a model captures more of the
structure of a language, then the entropy of the model should be
lower. In other words, we can use entropy as a measure of the quality
of our models…
But how about this example:
Suppose we have a machine that spit $2$ characters, A and B, one by one. And the designer of the machine makes A and B has the equal probability.
I am not the designer. And I try to model it through experiment.
During a initial experiment, I see the machine split the following character sequence:
A, B, A
So I model the machine as $P(A)=frac{2}{3}$ and $P(B)=frac{1}{3}$. And we can calculate entropy of this model as :
$$
frac{-2}{3}cdotlog{frac{2}{3}}-frac{1}{3}cdotlog{frac{1}{3}}= 0.918quadtext{bit}
$$
(the base is $2$ so the unit is bit)
But then, the designer tell me about his design, so I refined my model with this more information. The new model looks like this:
$P(A)=frac{1}{2}$ $P(B)=frac{1}{2}$
And the entropy of this new model is:
$$
frac{-1}{2}cdotlog{frac{1}{2}}-frac{1}{2}cdotlog{frac{1}{2}} = 1quadtext{bit}
$$
The second model is obviously better than the first one. But the entropy increased.
My point is, due to the arbitrariness of the model being tried, we cannot blindly say a smaller entropy indicates a better model.
Could anyone shed some light on this?
Best Answer
(For more info, please check here: https://stackoverflow.com/questions/22933412/why-can-we-use-entropy-to-measure-the-quality-of-language-model)
After I re-digested the mentioned NLP book. I think I can explain it now.
What I calculated is actually the entropy of the language model distribution. It cannot be used to evaluate the effectiveness of a language model.
To evaluate a language model, we should measure how much surprise it gives us for real sequences in that language. For each real word encountered, the language model will give a probability p. And we use -log(p) to quantify the surprise. And we average the total surprise over a long enough sequence. So, in case of a 1000-letter sequence with 500 A and 500 B, the surprise given by the 1/3-2/3 model will be:
[-500log(1/3) – 500log(2/3)]/1000 = 1/2 * Log(9/2)
While the correct 1/2-1/2 model will give:
[-500log(1/2) – 500log(1/2)]/1000 = 1/2 * Log(8/2)
So, we can see, the 1/3, 2/3 model gives more surprise, which indicates it is worse than the correct model.
Only when the sequence is long enough, the average effect will mimic the expectation over the 1/2-1/2 distribution. If the sequence is short, it won't give a convincing result.
I didn't mention the cross-entropy here since I think this jargon is too intimidating and not much helpful to reveal the root cause.
ADD 1
I'd like to share a bit more about my understanding of the surprise mentioned in last paragraph.
The word surprise here is actually a synonym to the information gain. But it is very abstract and subjective say how much information we gained through some experience. So we need a concrete and objective measure of that. This measure is the $-log(p)$.
Well, among so many mathematical options, why do we choose this very function? I read about Psychophysics where it mentions the logarithmic rule that relate physical stimuli to the contents of consciousness. So I think this explains why we choose the $log(p)$. As to why we add -
to it, I think it's because human tends to use positive numbers as measure, such as length, area. And probability is a value between [0-1], which leads to negative values in raw $log(p)$ function. Also, this measure design is consistent with our common sense that less possible things give more info/surprise. So that's it.
Put it simply, we use mathematics to describe/model the world. And mathematics just reflect our instinct ultimately.