# Solved – Model ensembling – averaging of probabilities

From the BatchNorm paper, section 4.2.3, (https://arxiv.org/abs/1502.03167),

The ensemble prediction was based on the arithmetic average of class
probabilities predicted by the constituent networks.

Is there a theoretical basis for doing this? Is the output value after averaging of individual probabilities, still a valid probability?

Contents

From the law of total probability we know that for disjoint events $$H_n$$, we can calculate:

$$P(A) = sum_n P(A|H_n) * P(H_n)$$

Basically if $$P(A|H_n), n:1,…,N$$ are different networks emitting probabilities, and $$H_n$$ is a disjoint hypothesis space then the result is a probability.

When doing simple averaging they are assuming that $$P(H_n) = frac{1}{N}$$ for all $$n:1,..,N$$; a discrete uniform distribution.

The biggest problem with these kind of averages is that nobody really checks if the hypotheses are in fact disjoint or whether it makes sense to assign equal probabilities to each or not. Hypotheses usually end up being very similar to each other. As a result, mathematically speaking the result is still a probability, but from a Bayesian averaging point of view, it is not a well thought prior.

Rate this post