There have been a few papers this year, concerned with very large scale training, where instead than decaying the learning rate $eta$, the batch size $B$ was increased, usually with the same schedule as it would have been used for $eta$. Why does that work? Intuitively, I would expect that smaller batches would result in more noisy updates, and thus have a regularizing effect. Vice versa, large $BRightarrow$ less noisy updates (gradient estimation has less variance) $Rightarrow$ less regularization.
Now, in a very hand-wavy way, I would expect regularization in general to make optimization problems easier/more stable (for sure this is true of some kinds of regularization such as Tikhonov regularization or $L_2-$regularization for Least Squares, i.e., ridge regression). If this is correct, then large batches correspond to less regularization. So, why can I increase $B$ instead of decreasing $eta$? Shouldn't I decrease $eta$ even more, in order to compensate for the reduced numerical stability?
Or is this stabilization view completely wrong, and increasing $B$ means that I can use a larger $eta$ simply because the estimate of $nabla_{mathbf{w}} mathcal{L}$ is more accurate, thus I can take larger step sizes in the direction of the (local) minimum?
Best Answer
I think it may be a confusion about the different meanings of stability — stable as in numerically stable / weights don't go to infinity, versus stable as in the loss steadily monotonically decreases and you eventually converge to some good solution.
L2 regularization accomplishes the first, but doesn't necessarily help with the second. Small batch size isn't necessarily stable in the first sense and is unstable in the second sense. Large batch size also isn't necessarily stable in the first sense but is stable in the second sense.
In terms of selecting batch size / learning rate for large scale training, we're concerned more about the second sense of stability.
One way to see it is that if you take $B$ steps with batch size 1 and learning rate $eta$, it should be pretty close to taking a single step with batch size $B$ and learning rate $Beta$, assuming the gradient is roughly constant with mean $mu$ over these $B$ steps and our minibatch gradient estimate has variance $frac{sigma^2}{B}$.
Similar Posts:
- Solved – Sum or average of gradients in (mini) batch gradient decent?
- Solved – Does Keras SGD optimizer implement batch, mini-batch, or stochastic gradient descent
- Solved – Why scale cost functions by 1/n in a neural network
- Solved – Stochastic gradient descent Vs Mini-batch size 1
- Solved – Stochastic gradient descent Vs Mini-batch size 1