# Solved – Why zero-centered output affected the backpropagation

I read the answer in Why are non zero-centered activation functions a problem in backpropagation? but I still can't understand.

Assume\$\$f=sum w_ix_i+b\$\$ \$\$sigma(x)=dfrac{1}{1+e^{-x}}\$\$, and loss function is \$\$L=sigma(f)\$\$

To my understand, the gradient \$\$dfrac{dL}{dw_i}=dfrac{dL}{df}x_i=dfrac{dL}{dsigma}dfrac{dsigma}{df}x_i\$\$
So \$dfrac{dL}{dw_i}\$ is actually depends on so-called upstream gradient \$dfrac{dL}{dsigma}\$, since \$dfrac{dsigma}{df}\$ is always positive.

So to my understand, I don't think non-zero-centred output(\$sigma(x)\$) activation function is a problem, the problem is non-zero-centred derivative (\$dfrac{dsigma}{df}\$) of activation function.

Is there anything wrong?

And what's the mathematical expression of zero-centred? Is \$int_{-infty}^{infty} f(x) dx=0\$?

Contents

The effection is between two layers.

Consider a three-layer network:

\$\$X Rightarrow h_1 = f(W_1X+b_1) Rightarrow h_2 = f(W_2h_1+b_2) Rightarrow L = W_3h_2+b_3\$\$

where \$f(x)\$ is sigmoid function.

When we optimize parameter \$W_2\$, no matter whether input \$X\$ is zero-centred or not, the input of this layer from the previous layer \$h_1\$ is always positive because sigmoid function is used in the previous layer as activation function, so \$frac{dL}{dW_{2,ij}}\$ have the same sign and will cause a zig-zag path during optimization.

Rate this post