I am new to evolutionary algorithm. I have studied Covariance Matrix Adaptation Evolution Strategy. I am not good at statistics. So could you please explain me in simple language (I mean not too many equations)
- What is CMA-ES?
- How does it work?
- Why is it superior to other strategies?
Best Answer
Per the wikipedia page linked above to answer (1) this is another form of gradient descent (which if you need more information with lots of pictures there are many articles available if you google it — sorry apparently new posters only get 2 urls so I'm having to have to tell you to search instead of pointing you to specific ones) so it is used for optimization of an objective function. Usually this means I want to find the maximum or minimum value for a function I'm interested in.
Not to split hairs with the other poster, but how statistical this algorithm is a question of semantics. It is proposed to converge (for our considerations that is to say this method works) based on a maximum likelihood argument and covariance (variance in higher dimension — the spread of the sample in the space) is used as part of the update rule for this algorithm to determine how best to proceed for the next iteration. Also new points are sampled from a multivariate normal distribution. Which is a probability model. Similarly as the wikipedia points out this method is similar to, but not identical to, principal component analysis. So from my perspective it seems pretty statistical, but again that is perhaps subjective.
I suppose the second wiki link and parts of the above address (2), but perhaps not adequately so if you need more information please feel free to follow up. The short of it is if there is an optimum this procedure searches for it in the problem space and will eventually find it under the model assumptions. You need to test for the assumptions, because deviation from them is likely a major consequence to and how well this method will work on your problem.
I do however agree with the other poster that you may find more if not better answers on a site like Meta Optimize which is a community like this one specific to machine learning, but many Statisticians do work on learning problems so I would expect others will weigh in as well.
As for (3) it is not necessarily a better method. It is a method for a different set of assumptions than other methods (such as feedforward neural network with gradient descent for example) specifically this method is for non-linear or non-convex problems. Again you need to test if the problem you are working on is appropriate for this method before proceeding to use it. Otherwise you may get a less than optimal solution or worse yet garbage in garbage out. If your problem is non-linear or non-convex then go for it, otherwise you may need to look into using a different method. There is a part of the wikipedia article "Performance in Practice" may help you determine whether this is an appropriate method for the problem you are working on.
You alluded to some mathphobia when not wanting to see too many expressions. So you may not know some properties of the data/problem you're working with. Are you willing to share some more specific details or are you just asking in general?
Hopefully this helps. Please let me know if there are any follow ups. Good luck.