I am slightly confused by the the statement that Kullback–Leibler divergence is the same as [information gain](Information gain in decision trees).

I cannot understand how $D_{KL}(P||Q) = H(P,Q)- H(P)$ can be represent as $IG(T,a) = H(T)-H(T|a)$

I would appreciate for explanation.

**Contents**hide

#### Best Answer

$Q$ is "wrong" distribution, while $P$ is the "right" distribution. The length of a code $i$ under the "wrong" coding is $-log(q_{i})$. So, the average message length under the "wrong" coding is:

$-sum p_{i} log(q_{i}) = H(P,Q)$

Under the "right" coding, the average message length is $H(P)$. Imagine that we are using entropy coding like arithmetic coding, but we have not estimated the probability distribution right. Then the average message length is a bit larger then the theoretical limit $H(P)$. The difference is Kullback–Leibler divergence.

### Similar Posts:

- Solved – Why is Kullback Leibler Divergence always positive
- Solved – KLDIV Kullback-Leibler or Jensen-Shannon divergence between two distributions
- Solved – Disadvantages of the Kullback-Leibler divergence
- Solved – Disadvantages of the Kullback-Leibler divergence
- Solved – Disadvantages of the Kullback-Leibler divergence