I have two CNN versions which are distinguished by a sigmoid layer.
- CNN | last two layers:
CONV
+SIGMOID
- CNN | last layer:
CONV
My output range of my ground truth values is [0,1]
The loss function I use is the L2
loss.
When I train both networks the second one outperforms the first one by far.
For example:
1. At the beginning: loss = 230
1. After 3 epochs: loss = 23
- At the beginning: loss = 18
- After 100 iterations loss = 4
I do not understand why the version with the SIGMOID
does never get near the solution without the sigmoid. I have been reading up on this and some people say if the L2
loss does not go well with the SIGMOID
which can be proven mathematically. However, in the end, I would understand if there is some sort of difference for the loss, but the difference is huge.
Best Answer
I would guess that there are two things at work here. First your initialization seems to perform worse for sigmoid than for the linear output layer. Maybe your output is normalized around 0.5 which would be close to 1 for sigmoid and pretty good for your other network.
The second problem is in my opinion the learning rate. The gradient of sigmoid (s(x)) is s(x)(1 – s(x)), which is quite small compared to 1 for a linear function. Therefore by setting a higher learning rate the loss should decrease in a similar fashion.
In the end the result should be similar iff you train until convergence.