Recently I have read a paper by Yann Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, where they introduce an interesting descent algorithm called Saddle-Free Newton, which seems to be exactly tailored for neural network optimization and shouldn't suffer from getting stuck at saddle points like first order methods as vanilla SGD.
The paper dates back into 2014, so it's nothing brand new, however, I haven't seen it being used "in the wild". Why is this method not being used? Is the Hessian computation too prohibitive for real world sized problems/networks? Is there even some open source implementation of this algorithm, possibly to be used with some of the major deep learning frameworks?
Update Feb 2019: there is an implementation available now: https://github.com/dave-fernandes/SaddleFreeOptimizer)
Better optimization does not necessarily mean a better model. In the end what we care about is how well the model generalizes, and not necessarily how good the performance on the training set is. Fancier optimization techniques usually perform better and converge faster on the training set, but do not always generalize as well as basic algorithms. For example this paper shows that SGD can generalize better than ADAM optimizer. This can also be the case with some second order optimization algorithms.
[Edit] Removed the first point as it does not apply here. Thanks to bayerj for pointing this out.
- Solved – Why do saddle points become “attractive” in Newtonian dynamics
- Solved – Understanding “almost all local minimum have very similar function value to the global optimum”
- Solved – Why doesn’t gradient descent terminate on saddle point
- Solved – R or python implementation of sparse PCA for p>n
- Solved – checking the correct implementation for gradient descent algorithm by looking at if the loss is monotonically decreasing