I do not fully understand how
second-order optimization approaches help machine learning algorithms, like multilayer perceptron, to achieve the global minimum error. As you know, Stochastic Gradient Descent is in the
first-order optimization family as it helps to optimize the error function by going downhill toward the global minimum. On the other hand, the second-order optimization like L-BFGS has been also an alternative approach which relies on finding the second-order derivative of the target function. I learned in calculus that second derivative tells you convexity/concavity of the function. Hence, how does that help for machine learning? In other word, does it mean that if I know if the function is concave then I can tell where the global minimum error is?
In other word, does it mean that if I know if the function is concave then I can tell where the global minimum error is?
Yes, you got the gist of it. You get the first two derivatives of the function, which effectively is a way to fit the parabola to the loss function. Once you know the parabola equation, you locate its minimum and jump right there, hoping that minimum would be somewhere in the vicinity. You keep repeating the procedure until you're close enough to a minimum.
There are variations of second order approaches, of course. For instance, you might know the exact second derivative (Hessian), or you might make a couple of steps and numerically estimate the curvature etc.