Solved – Understanding of the sigmoid activation function as last layer in network

I have two CNN versions which are distinguished by a sigmoid layer.

  1. CNN | last two layers: CONV + SIGMOID
  2. CNN | last layer: CONV

My output range of my ground truth values is [0,1]

The loss function I use is the L2 loss.

When I train both networks the second one outperforms the first one by far.

For example:
1. At the beginning: loss = 230
1. After 3 epochs: loss = 23

  1. At the beginning: loss = 18
  2. After 100 iterations loss = 4

I do not understand why the version with the SIGMOID does never get near the solution without the sigmoid. I have been reading up on this and some people say if the L2 loss does not go well with the SIGMOID which can be proven mathematically. However, in the end, I would understand if there is some sort of difference for the loss, but the difference is huge.

I would guess that there are two things at work here. First your initialization seems to perform worse for sigmoid than for the linear output layer. Maybe your output is normalized around 0.5 which would be close to 1 for sigmoid and pretty good for your other network.

The second problem is in my opinion the learning rate. The gradient of sigmoid (s(x)) is s(x)(1 – s(x)), which is quite small compared to 1 for a linear function. Therefore by setting a higher learning rate the loss should decrease in a similar fashion.

In the end the result should be similar iff you train until convergence.

Similar Posts:

Rate this post

Leave a Comment