Solved – What does Diagonal Rescaling of the gradients mean in ADAM paper

I was reading the original paper on ADAM (Adam: A Method for Stochastic Optimization), which mentions:

[…] invariant to diagonal rescaling of the gradients, […]

What does it mean?
Also, another paper – Normalized Direction-preserving Adam – mentions:

Second, while the magnitudes of Adam parameter updates are invariant to rescaling of the gradient, the effect of the updates on the same overall network function still varies with the magnitudes of parameters.

The original Adam paper briefly explains what it means by "invariant to diagonal rescaling of the gradients" at the end of section 2.1.

I would try to explain it in some more detail.

Like stochastic gradient descent (SGD), Adam is an iterative method that uses gradients in order to find a minimum of a function.
(By "gradients" I mean "the values of the gradient in different locations in parameter space". I later use "partial derivatives" in a similar fashion.)

But in contrast to SGD, Adam doesn't really use gradients. Instead, Adam uses the partial derivatives of each parameter independently.
(By "partial derivative of a parameter $x$" I mean "partial derivative of the cost function $C$ with respect to $x$", i.e. $frac{partial C}{partial x}$.)

Let $Delta^{(t)}$ be the step that Adam takes in parameter space in the $t^{text{th}}$ iteration. Then the step it takes in the dimension of the $j^{text{th}}$ parameter (in the $t^{text{th}}$ iteration) is $Delta^{(t)}_j$, which is given by: $$Delta^{(t)}_j=-frac{alpha}{sqrt{{hat v}^{(t)}_j}+epsilon}cdot {hat m}^{(t)}_j$$ while:

  • $alpha$ is the learning rate hyperparameter.
  • $epsilon$ is a small hyperparameter to prevent division by zero.
  • ${hat m}^{(t)}_j$ is an exponential moving average of the partial derivatives of the $j^{text{th}}$ parameter that were calculated in iterations $1$ to $t$.
  • ${hat v}^{(t)}_j$ is an exponential moving average of the squares of the partial derivatives of the $j^{text{th}}$ parameter that were calculated in iterations $1$ to $t$.

Now, what happens when we scale the partial derivative of the $j^{text{th}}$ parameter by a positive factor $c$?
(I.e. the partial derivative of the $j^{text{th}}$ parameter is just a function whose domain is the parameter space, so we can simply multiply its value by $c$.)

  • ${hat m}^{(t)}_j$ becomes $ccdot{hat m}^{(t)}_j$
  • ${hat v}^{(t)}_j$ becomes $c^2cdot{hat v}^{(t)}_j$
  • Thus (using the fact that $c>0$), we get that $Delta^{(t)}_j$ becomes: $$-frac{alpha}{sqrt{c^2cdot{hat v}^{(t)}_j}+epsilon}cdot ccdot{hat m}^{(t)}_j=-frac{alpha}{ccdotsqrt{{hat v}^{(t)}_j}+epsilon}cdot ccdot{hat m}^{(t)}_j$$ And assuming $epsilon$ is very small, we get: $$begin{gathered} -frac{alpha}{ccdotsqrt{{hat v}^{(t)}_j}+epsilon}cdot ccdot{hat m}^{(t)}_japprox -frac{alpha}{ccdotsqrt{{hat v}^{(t)}_j}}cdot ccdot{hat m}^{(t)}_j=\ -frac{alpha}{sqrt{{hat v}^{(t)}_j}}cdot{hat m}^{(t)}_japprox-frac{alpha}{sqrt{{hat v}^{(t)}_j}+epsilon}cdot{hat m}^{(t)}_j end{gathered}$$

I.e. scaling the partial derivative of the $j^{text{th}}$ parameter by a positive factor $c$ actually doesn't affect $Delta^{(t)}_j$.

Finally, let $g=left(begin{gathered}g_{1}\ g_{2}\ vdots end{gathered} right)$ be the gradient. Then $g_j$ is the partial derivative of the $j^{text{th}}$ parameter.

What happens when we multiply the gradient by a diagonal matrix with only positive elements? $$left(begin{matrix}c_{1}\ & c_{2}\ & & ddots end{matrix}right)g=left(begin{matrix}c_{1}\ & c_{2}\ & & ddots end{matrix}right)left(begin{gathered}g_{1}\ g_{2}\ vdots end{gathered} right)=left(begin{gathered}c_{1}cdot g_{1}\ c_{2}cdot g_{2}\ vdots end{gathered} right)$$ So it would only scale each partial derivative by a positive factor, but as we have seen above, this won't affect the steps that Adam takes.

In other words, Adam is invariant to multiplying the gradient by a diagonal matrix with only positive factors, which is what the paper means by "invariant to diagonal rescaling of the gradients".

With regard to the quote from the paper Normalized Direction-preserving Adam, it describes the "ill-conditioning problem". (This is how the paper names the problem. Note that it is a different problem from the problem of an ill-conditioned Hessian.)
It seems to me that this problem is unrelated to Adam (and unrelated to the fact that it is invariant to rescaling of the gradient). I deduced that mostly from two other quotes in the paper, that elaborate on the ill-conditioning problem:

  1. Furthermore, even when batch normalization is not used, a network using linear rectifiers (e.g., ReLU, leaky ReLU) as activation functions, is still subject to ill-conditioning of the parameterization (Glorot et al., 2011), and hence the same problem. We refer to this problem as the ill-conditioning problem.

    The quote refers to the paper Deep Sparse Rectifier Neural Networks, which never mentions Adam, and also describes the problem of "ill-conditioning of the parametrization", which seems to me very similar (if not identical) to the "ill-conditioning problem".

  2. The ill-conditioning problem occurs when the magnitude change of an input weight vector can be compensated by other parameters, such as the scaling factor of batch normalization, or the output weight vector, without affecting the overall network function. Consequently, suppose we have two DNNs that parameterize the same function, but with some of the input weight vectors having different magnitudes, applying the same SGD or Adam update rule will, in general, change the network functions in different ways. Thus, the ill-conditioning problem makes the training process inconsistent and difficult to control.

    If I understand correctly, this quote says that both SGD and Adam suffer from the ill-conditioning problem. I.e. the problem isn't unique to Adam.

Similar Posts:

Rate this post

Leave a Comment