Solved – Information gain is KL divergence

I am slightly confused by the the statement that Kullback–Leibler divergence is the same as [information gain](Information gain in decision trees).

I cannot understand how $D_{KL}(P||Q) = H(P,Q)- H(P)$ can be represent as $IG(T,a) = H(T)-H(T|a)$

I would appreciate for explanation.

$Q$ is "wrong" distribution, while $P$ is the "right" distribution. The length of a code $i$ under the "wrong" coding is $-log(q_{i})$. So, the average message length under the "wrong" coding is:

$-sum p_{i} log(q_{i}) = H(P,Q)$

Under the "right" coding, the average message length is $H(P)$. Imagine that we are using entropy coding like arithmetic coding, but we have not estimated the probability distribution right. Then the average message length is a bit larger then the theoretical limit $H(P)$. The difference is Kullback–Leibler divergence.

Similar Posts:

Rate this post

Leave a Comment