I read the following section from cs231n course notes:

Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on

this soon) would be receiving data that is not zero-centered. This has

implications on the dynamics during gradient descent, because if the

data coming into a neuron is always positive (e.g. $x > 0$

elementwise in $f = w^Tx + b$)), then the gradient on the weights

$w$ will during backpropagation become either all be positive, or

all negative (depending on the gradient of the whole expression

$f$). This could introduce undesirable zig-zagging dynamics in the

gradient updates for the weights. However, notice that once these

gradients are added up across a batch of data the final update for the

weights can have variable signs, somewhat mitigating this issue.

Therefore, this is an inconvenience but it has less severe

consequences compared to the saturated activation problem above.

I have understood why the gradients with respect to weights $w$ become all positive or negative during backpropagation since

$$dfrac{partial f}{partial w_j}=x_j text{ , and } dfrac{partial L}{partial w_j}=dfrac{partial L}{partial f}dfrac{partial f}{partial w_j}=dfrac{partial L}{partial f}x_j$$

Thus the gradient of $L$ with respect to weights are all positive or negative depending on the sign of $frac{partial L}{partial f}$.

But I do not understand why it has implications on the dynamics during gradient descent. More precisely, why do we get 'zig-zag' gradient updates if the derivatives with respect to weights are all positive or all negative? Can you provide some intuitions and mathematical arguments to justify this?

**Contents**hide

#### Best Answer

If the gradients are all the same sign, all the weights will either have to increase, or decrease over one iteration. So based on the step length, if you overshoot in the + direction, all weights will have to adjust in the – direction in the next time step. I think the idea he is getting at is similar to what you see in steepest descent (see slide 9 of http://www.robots.ox.ac.uk/~az/lectures/opt/lect1.pdf).

### Similar Posts:

- Solved – the problem with training Neural Networks with back propagation with activation functions that only output positive values?
- Solved – what is vanishing gradient
- Solved – Help understanding Vanishing and Exploding Gradients
- Solved – Can neural network (e.g., convolutional neural network) have negative weights
- Solved – Autoencoders’ gradient when using tied weights