I've come across something called a dropout method that involves setting a threshold parameter $p$ and then for each predictor in your training set, generate a uniform random number. If that uniform random number is less than p, drop it. Basically, for each predictor in the training sample, set the predictor values to 0 with probability $p$. This for $B$ trials to create $B$ separate training sets.
Fit decision tree models $hat{h}^1(x), ldots, hat{h}^B(x) in {0,1}$ to the $B$ training sets.
Combine the decision tree models into a single classifier by taking a majority vote:
$$
hat{H}_{maj}(x) ,=, majorityBig(hat{h}^1(x), ldots, hat{h}^B(x)Big).
$$
I am wondering how this is different than Random Forests. Are they the same? Thanks.
Best Answer
They are similar at a high level, but the details are quite different.
The main difference is in how the results are combined.
In a random forest, each split sees a random subspace of $X$ — some variables are candidates for a split, and some are not. This is different from what you wrote above — you implied that this happens for each tree. Instead, it happens at each split in each tree. This helps ensure that the different trees in the forest are not too heavily correlated with one another, as they see the same data "from a different perspective", and so they find the answer by a different route.
Random forests are parallelizable — each tree is independent of the next. This is not true with neural net backpropagation. It is inherently serial. At each iteration, $X$ is subset to a random subspace of itself, while the nodes upward from $X$ are also set to a random subspace of themselves. The network is thus "thinned". Here's the image used in the original dropout paper:
Note that thinning the nodes implies thinning the parameters that connect variables to nodes and nodes to nodes. At each iteration of gradient descent, a different set of nodes drops out, and updates are performed only to the weights that aren't thinned as a consequence of dropout.
When random forests are combined, it is different estimates of $hat y$ from the different trees that are combined.
With dropout, there are no different estimates of the weights. There is just one set, though each one has only been trained $iter times p$ times, where $p$ is the probability of dropout.
At test time then, the weights are simply multiplied by $p$, and the final network is formed with those shrunken weights. The idea is that we are taking the expectation of the weights, thinking of each backprop step as a different net in some sense.
That final multiplication step never sat well with me — it seems to spit upon Jensen's inequality. But it works pretty well in practice, and also speeds training quite a bit.
Here is the original dropout paper.