There are two popular R packages to build random forests introduced by Breiman (2001): randomForest and randomForestSRC. I am noticing small, yet significant discrepancies in terms of accuracy between the two packages, even when I try to use the same input parameters. I understand we would expect a slightly different random forest, but in example below, randomForestSRC package consistently outperforms the randomForest package. I'm guessing there are other examples where randomForest is superior. Can someone please explain why these packages provide different predictions? Is there a way to generate a random forest for both packages using the same methodology?
In the example, there's no missing data, all values are distinct, mtry=1, and trees are grown until nodesplit=5. I believe the same bootstrap approach and split rule is used too. Increasing ntree or number of observations in the simulated dataset does not change the relative difference between the two packages.
library(randomForest) library(randomForestSRC) set.seed(130948) #Other seeds give similar comparative results x1<-runif(1000) y<-rnorm(1000,mean=x1,sd=.3) data<-data.frame(x1=x1,y=y) #Compare MSE using OOB samples based on output (modRF<-randomForest(y~x1,data=data,ntree=500,nodesize=5)) (modRFSRC<-rfsrc(y~x1,data=data,ntree=500,nodesize=5)) #Compare MSE using a test sample x1new<-runif(10000) ynew<-rnorm(10000,mean=x1new,sd=.3) newdata<-data.frame(x1=x1new,y=ynew) mean((predict(modRF,newdata=newdata)-newdata$y)^2) #MSE using randomForest mean((predict(modRFSRC,newdata=newdata)$predicted-newdata$y)^2) #MSE using randomForestSRC
One of the causes of the packages producing different results is the way nodesize is implemented internally. In randomForest, the value appears to be a strict lower bound. In randomForestSRC, while we (unfortunately) don't document the subtlety, we will not attempt to split a node without at least 2 * nodesize replicates in a node. But when we do, it can result in one daughter < nodesize, and the other daughter >= nodesize. What we can say is that "on average" our terminal nodes across the forest will be of size = nodesize. The result is that we can grow slightly better trees than RF with the "same" setting.
If you set nodesize = 1 to avoid this issue, and accommodate for Monte Carlo effects by growing multiple forests with multiple simulations you will find that the MSE for both packages are coincident.