From the BatchNorm paper, section 4.2.3, (https://arxiv.org/abs/1502.03167),
The ensemble prediction was based on the arithmetic average of class
probabilities predicted by the constituent networks.
Is there a theoretical basis for doing this? Is the output value after averaging of individual probabilities, still a valid probability?
Best Answer
From the law of total probability we know that for disjoint events $H_n$, we can calculate:
$$P(A) = sum_n P(A|H_n) * P(H_n)$$
Basically if $P(A|H_n), n:1,…,N$ are different networks emitting probabilities, and $H_n$ is a disjoint hypothesis space then the result is a probability.
When doing simple averaging they are assuming that $P(H_n) = frac{1}{N}$ for all $n:1,..,N$; a discrete uniform distribution.
The biggest problem with these kind of averages is that nobody really checks if the hypotheses are in fact disjoint or whether it makes sense to assign equal probabilities to each or not. Hypotheses usually end up being very similar to each other. As a result, mathematically speaking the result is still a probability, but from a Bayesian averaging point of view, it is not a well thought prior.
Similar Posts:
- Solved – Average value prediction for Artificial Neural Network
- Solved – Ensemble of convolutional neural networks for pattern recognition tasks
- Solved – If the sum of the probabilities of events is equal to the probability of their union, does that imply that the events are disjoint
- Solved – Can a GAN be used for tabular/vector data augmentation
- Solved – Averaging weights learned during backpropogation