I read the answer in Why are non zero-centered activation functions a problem in backpropagation? but I still can't understand.
Assume$$f=sum w_ix_i+b$$ $$sigma(x)=dfrac{1}{1+e^{-x}}$$, and loss function is $$L=sigma(f)$$
To my understand, the gradient $$dfrac{dL}{dw_i}=dfrac{dL}{df}x_i=dfrac{dL}{dsigma}dfrac{dsigma}{df}x_i$$
So $dfrac{dL}{dw_i}$ is actually depends on so-called upstream gradient $dfrac{dL}{dsigma}$, since $dfrac{dsigma}{df}$ is always positive.
So to my understand, I don't think non-zero-centred output($sigma(x)$) activation function is a problem, the problem is non-zero-centred derivative ($dfrac{dsigma}{df}$) of activation function.
Is there anything wrong?
And what's the mathematical expression of zero-centred? Is $int_{-infty}^{infty} f(x) dx=0$?
Best Answer
The effection is between two layers.
Consider a three-layer network:
$$X Rightarrow h_1 = f(W_1X+b_1) Rightarrow h_2 = f(W_2h_1+b_2) Rightarrow L = W_3h_2+b_3$$
where $f(x)$ is sigmoid function.
When we optimize parameter $W_2$, no matter whether input $X$ is zero-centred or not, the input of this layer from the previous layer $h_1$ is always positive because sigmoid function is used in the previous layer as activation function, so $frac{dL}{dW_{2,ij}}$ have the same sign and will cause a zig-zag path during optimization.
Similar Posts:
- Solved – Non zero centered activation functions
- Solved – Softmax in multi-class in deep NNs
- Solved – the problem with training Neural Networks with back propagation with activation functions that only output positive values?
- Solved – Derivative of Logistic Loss function
- Solved – Update of the cell state functions in LSTM RNNs – Interplay of sigmoid and tanh