I read the answer in Why are non zero-centered activation functions a problem in backpropagation? but I still can't understand.

Assume$$f=sum w_ix_i+b$$ $$sigma(x)=dfrac{1}{1+e^{-x}}$$, and loss function is $$L=sigma(f)$$

To my understand, the gradient $$dfrac{dL}{dw_i}=dfrac{dL}{df}x_i=dfrac{dL}{dsigma}dfrac{dsigma}{df}x_i$$

So $dfrac{dL}{dw_i}$ is actually depends on so-called *upstream gradient* $dfrac{dL}{dsigma}$, **since $dfrac{dsigma}{df}$ is always positive**.

So to my understand, I don't think non-zero-centred **output**($sigma(x)$) activation function is a problem, the problem is non-zero-centred **derivative** ($dfrac{dsigma}{df}$) of activation function.

Is there anything wrong?

And what's the mathematical expression of *zero-centred*? Is $int_{-infty}^{infty} f(x) dx=0$?

**Contents**hide

#### Best Answer

The effection is between two layers.

Consider a three-layer network:

$$X Rightarrow h_1 = f(W_1X+b_1) Rightarrow h_2 = f(W_2h_1+b_2) Rightarrow L = W_3h_2+b_3$$

where $f(x)$ is sigmoid function.

When we optimize parameter $W_2$, no matter whether input $X$ is zero-centred or not, the input of this layer from the previous layer $h_1$ is always positive because sigmoid function is used in the previous layer as activation function, so $frac{dL}{dW_{2,ij}}$ have the same sign and will cause a zig-zag path during optimization.

### Similar Posts:

- Solved – Non zero centered activation functions
- Solved – Softmax in multi-class in deep NNs
- Solved – the problem with training Neural Networks with back propagation with activation functions that only output positive values?
- Solved – Derivative of Logistic Loss function
- Solved – Update of the cell state functions in LSTM RNNs – Interplay of sigmoid and tanh