# Solved – Why increasing the batch size has the same effect as decaying the learning rate

There have been a few papers this year, concerned with very large scale training, where instead than decaying the learning rate $$eta$$, the batch size $$B$$ was increased, usually with the same schedule as it would have been used for $$eta$$. Why does that work? Intuitively, I would expect that smaller batches would result in more noisy updates, and thus have a regularizing effect. Vice versa, large $$BRightarrow$$ less noisy updates (gradient estimation has less variance) $$Rightarrow$$ less regularization.

Now, in a very hand-wavy way, I would expect regularization in general to make optimization problems easier/more stable (for sure this is true of some kinds of regularization such as Tikhonov regularization or $$L_2-$$regularization for Least Squares, i.e., ridge regression). If this is correct, then large batches correspond to less regularization. So, why can I increase $$B$$ instead of decreasing $$eta$$? Shouldn't I decrease $$eta$$ even more, in order to compensate for the reduced numerical stability?

Or is this stabilization view completely wrong, and increasing $$B$$ means that I can use a larger $$eta$$ simply because the estimate of $$nabla_{mathbf{w}} mathcal{L}$$ is more accurate, thus I can take larger step sizes in the direction of the (local) minimum?

Contents

One way to see it is that if you take $$B$$ steps with batch size 1 and learning rate $$eta$$, it should be pretty close to taking a single step with batch size $$B$$ and learning rate $$Beta$$, assuming the gradient is roughly constant with mean $$mu$$ over these $$B$$ steps and our minibatch gradient estimate has variance $$frac{sigma^2}{B}$$.