Solved – Why is binary cross entropy (or log loss) used in autoencoders for non-binary data

I am working on an autoencoder for non-binary data ranging in [0,1] and while I was exploring existing solutions I noticed that many people (e.g., the keras tutorial on autoencoders, this guy) use binary cross-entropy as the loss function in this scenario. While the autoencoder works, it produces slightly blurry reconstructions, which, among many reasons, might be because binary cross-entropy for non-binary data penalizes errors towards 0 and 1 more than errors towards 0.5 (as nicely explained here).

For example, give the true value is 0.2, and autoencoder A predicts 0.1 while autoencoder 2 predicts 0.3. The loss for A would be

−(0.2 * log(0.1) + (1−0.2) * log(1−0.2)) = .27752801

while the loss for B would be

−(0.2 * log(0.3) + (1−0.2) * log(1−0.3)) = .228497317

Hence, the B is considered to be a better reconstruction than A; if I got everything correct. But this does not exactly make sense to me as I am not sure why asymmetric is preferred over other symmetric loss functions like MSE.

In this video Hugo Larochelle argues that the minimum will still be at the point of perfect reconstruction, but the loss will never be zero (which makes sense). This is further explained in this excellent answer, which proves why the minimum of binary cross-entropy for non-binary values that are in [0,1] is given when the prediction equals the true value.

So, my question is: Why is binary cross-entropy used for non-binary values in [0,1] and why is the asymmetric loss is acceptable compared to other symmetric loss functions like MSE, MAE, etc.? Does it have a better loss landscape, i.e., is it convex while others are not, or are there other reasons?

Your question inspired me to have a look on loss function from point of view of mathematical analysis. This is a disclaimer – my background is in physics, not in statistics.

Let's rewrite $-loss$ as a function of NN output $x$ and find its derivative:

$ f(x) = a ln x + (1-a) ln (1-x) $

$ f'(x) = frac{a-x}{x(1-x)} $

where $a$ is the target value. Now we put $x = a + delta$ and assuming that $delta$ is small we can neglect terms with $delta^2$ for clarity:

$ f'(delta) = frac{delta}{a(a-1) + delta(2a-1)} $

This equation let us get some intuition how loss behaves. When target value $a$ is (close to) zero or one, derivative is constant $-1$ or $+1$. For $a$ around 0.5 the derivative is linear in $delta$.

In other words, during backpropagation this loss cares more about very bright and very dark pixels, but puts less effort on optimizing middle tones.

Regarding assymetry – when NN is far from optimum, it does not matter probably, as you will converge faster or slower. When NN is close to optimum ($delta$ is small) assymetry disappears.

Similar Posts:

Rate this post

Leave a Comment