Can Friedman's gradient boosting machine achieve better performance than Breiman's Random Forest? If so, in which conditions or what kind of data set can make gbm better?
The following provides an explanation as per why Boosting generally outperforms Random Forest in practice, but I would be very interested to know which other different factors may explain Boosting's edge over RF in specific settings.
Basically, within the $error=bias+variance$ framework, RF can only reduce error through reducing the variance (Hastie et al. 2009 p. 588). The bias is fixed and equal to the bias of a single tree in the forest (hence the need to grow very large trees, that have very low bias).
On the other hand, Boosting reduces bias (by adding each new tree in the sequence so that what was missed by the preceding tree is captured), but also variance (by combining many models).
So, Boosting reduces error on both fronts, whereas RF can only reduce error through reducing variance. Of course, as I said, there might be other explanations for the better performance of Boosting observed in practice. For instance, page 591 of the aforementioned book, it is said that Boosting outperforms RF on the nested sphere problem because in that particular case the true decision boundary is additive. (?) They also report that Boosting does better than RF for the spam and the California housing data.
Another reference that found Boosting to outperform RF is Caruana and Niculescu-Mizil 2006. Unfortunately, they report the results but don't try to explain what causes them. They compared the two classifiers (and many more) on 11 binary classification problems for 8 different performance metrics.