Solved – Why zero-centered output affected the backpropagation

I read the answer in Why are non zero-centered activation functions a problem in backpropagation? but I still can't understand.

Assume$$f=sum w_ix_i+b$$ $$sigma(x)=dfrac{1}{1+e^{-x}}$$, and loss function is $$L=sigma(f)$$

To my understand, the gradient $$dfrac{dL}{dw_i}=dfrac{dL}{df}x_i=dfrac{dL}{dsigma}dfrac{dsigma}{df}x_i$$
So $dfrac{dL}{dw_i}$ is actually depends on so-called upstream gradient $dfrac{dL}{dsigma}$, since $dfrac{dsigma}{df}$ is always positive.

So to my understand, I don't think non-zero-centred output($sigma(x)$) activation function is a problem, the problem is non-zero-centred derivative ($dfrac{dsigma}{df}$) of activation function.

Is there anything wrong?

And what's the mathematical expression of zero-centred? Is $int_{-infty}^{infty} f(x) dx=0$?

The effection is between two layers.

Consider a three-layer network:

$$X Rightarrow h_1 = f(W_1X+b_1) Rightarrow h_2 = f(W_2h_1+b_2) Rightarrow L = W_3h_2+b_3$$

where $f(x)$ is sigmoid function.

When we optimize parameter $W_2$, no matter whether input $X$ is zero-centred or not, the input of this layer from the previous layer $h_1$ is always positive because sigmoid function is used in the previous layer as activation function, so $frac{dL}{dW_{2,ij}}$ have the same sign and will cause a zig-zag path during optimization.

Similar Posts:

Rate this post

Leave a Comment