The ID3 algorithm uses "Information Gain" measure.

The C4.5 uses "Gain Ratio" measure which is Information Gain divided by SplitInfo, whereas SplitInfo is high for a split where records split evenly between different outcomes and low otherwise.

My question is:

How does this help to solve the problem that Information Gain is biased towards splits with many outcomes? I can't see the reason. SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the split.

It may very well be that there is a low number of outcomes (say 2), and the records are split evenly between those 2 outcomes. In that case, SplitInfo is high, Gain Ratio is low, and a split with few outcomes is less likely to be chosen by C4.5.

On the other hand, it may be that there is a low number of outcomes, but the distribution is far from even. In that case, SplitInfo is low, Gain Ratio is high, and a split with many outcomes is more likely to be chosen.

What am I missing?

**Contents**hide

#### Best Answer

SplitInfo doesn't even take into account the number of outcomes, just the distribution of records in the split.

But it **does** take the number of outcomes into account. (Even if it is *also* dependent on distribution, as you noted). Your comparison is between two situations with the same ("low") number of outcomes, so it can't possibly illustrate how `SplitInfo`

changes with a changing number of outcomes.

Consider the following 3 situations, all with even distribution for simplicity of comparison:

10 possible outcomes with even distribution

`SplitInfo = -10*(1/10*log2(1/10)) = 3.32`

100 possible outcomes with even distribution

`SplitInfo = -100*(1/100*log2(1/100)) = 6.64`

1000 possible outcomes with even distribution

`SplitInfo = -1000*(1/1000*log2(1/1000)) = 9.97`

So if you have to choose between 3 possible splitting scenarios, using only `Information Gain`

as in ID3, the latter would be chosen. However, using `SplitInfo`

in the `GainRatio`

, it should be clear that as the number of choices goes **up**, the `SplitInfo`

will also go up, and the `GainRatio`

will go **down**.

All of that was explained with an assumption of evenly distributed splits. However, even with non-uniform distribution, the above will still hold true. `SplitInfo`

will get higher as number of possible outcomes gets higher. Yes, if we hold number of possible outcomes constant and vary outcome distribution, then `SplitInfo`

will have some variance… but so will `Information Gain`

.