In this publication I found an explanation of the Hessian matrix, along with what it means for it to be ill-conditioned. In the paper, there is this link given between the error surface and the eigenvalues of the Hessian matrix:
The curvature of the error surface is given by the eigenvalues
$lambda_i$ of the Hessian matrix.
so it gives me a bit of hint as to why it might be important to care if it is poorly conditioned. But I'm not quite there yet, I have troubles seeing the consequences of an ill-conditioned Hessian.
So my question is: could you give me some intuitive understanding why should we care? In particular, in what models and how it can cause problems?
Best Answer
It is easiest understood when considering solving the linear problem, $$ Ax = b $$ where $b$ and $A$ are the problem data, and $x$ the parameters we are trying to estimate. In practice you have errors in $b$ which propagate through $A$. How? Assume we have only errors in the measurements, $b$, and denote $delta b$ and $delta x$ the errors in the measurements and estimatation, respectively. Because of linearty, $$ delta b = A delta x $$ In order to see how the measurements errors are magnified by the matrix $A$, you can calculate, $$ frac{||delta x ||}{||x||}/frac{||delta b ||}{||b||} $$ We have that this number is bounded by the condition number of $A$, $$ cond(A) = frac{sigma_{1}}{sigma_{n}} $$ where $sigma_{1}$ and $sigma_{n}$ are the biggest and smallest eigenvalue of $A$, resp. Hence, the bigger the condition number, the higher the magnification of errors.
Here, a low condition number corresponds to directions where the gradient is small, which leads to oscillations and slow convergence.
This issue has motivated a lot of research for the optimization of neural networks (as you already point out), which has led to the development of techniques like momentum (see On the importance of initialization and momentum in deep learning) and early stopping. This blog entry provides a very nice description of this topic.