When optimizing a convex function, doing an update like:
$$w_{t+1} =w_{t}+ c nabla(f(w)) $$ is recommended. Why is moving along $nabla(f(w))$ the fastest way to move closer to the goal? What's the intuition around this?
Best Answer
Your questions were:
- Why is moving along $nabla left( fleft(x right ) right )$ the fastest way to move closer to the goal?
- What's the intuition around this?
They are not the same question. A solid answer for the second, imo, leads to better engagement on the use, and on the first part. It is also more accessible for someone trying to get the basics down reasonably well.
My answers are, in reverse order:
(Answer 2) The intuition driving this is "steepest ascent". It is an assumption.
Some insight on going along the steepest direction can be seen in here:
This is from the "Topographic prominence" page on Wikipedia.
If you start near the bottom of Atkins hill, and want to take the fewest number of steps* to get to the top, then the steepest slope is a good approach. By "*" I mean that a step is a projection on the topographical map. It does not account for vertical distance, and assumes that if you want to be at some place, you can. This is good for functions, but can take work when climbing mountains.
If, on the other hand, you start at the top of Atkins hill, and use gradient ascent, then you don't go to Great Pond Mountain. You only stay where you are. The "gradient" doesn't give you a direction to go.
We like the gradient "where you are" on mathematical surfaces, because it can be computed or approximated. This is especially useful in places like neural networks or deep belief networks where the location of our destination peak isn't knowable when we start in on the problem. Finding it isn't easy, but we have computers to compute many many gradients while looking for the optimal configuration of parameters.
Some ways to handle the weakness of the intuition include:
- multiple uniformly randomly located initial points, and record the location and value of the highest result.
- Another is called simulated annealing.
- Conjugate gradient ascent isn't bad, it will get you to the peak of a parabola quicker.
- Over and under-relaxation can also accelerate convergence.
Like many things in math, and also like many things in the human experience, it is not a bad initial assumption, it is false but useful, and with time and work you can overcome its weaknesses.
(Answer 1) For your question 1, the "why" is a more challenging answer.
It is more symbolically intensive. One must talk about convex, and distance measures. There are derivatives and such.
One must also talk about what "fast" means. The fewest steps is zero, and the only way to do that is to start there. The second fewest is to know where it is, and take one step to get there.
There is a rule in optimal control that says "best" doesn't mean anything without a rubric. If you say an approach is best, then the rubric can often be pulled out of the answer. If you provide the problem and the rubric, then the approach that serves that rubric (assuming it exists) can often be determined. We could show the rubric's under which this approach is "best".
Fastest can mean different things. Do you mean fewest mathematical operations? Shortest time given an infinite supercomputer? Fast isn't strongly defined and I don't want to presume too much on your question.
Can you clarify what you are looking for in part (1) of your questions?
Update:
Using the method of Lagrange multipliers, the extremal values occur at critical points on the surface. If you assume that you started "near enough" to one of the critical points then by moving to points that reduce the gradient at your location to zero, you move to the critical points. "Near enough", though hand-wavey, usually lets Taylor Series assumptions convert your surface to being locally quadratic.
Similar Posts:
- Solved – How is gradient boosting like gradient descent
- Solved – Why can’t the gradient descent optimize a quantile regression loss
- Solved – Persistent Contrastive Divergence for RBMs
- Solved – Persistent Contrastive Divergence for RBMs
- Solved – Is a loss function computed after each step of gradient descent or after a whole epoch