I'm reading section 8.8 of Elements of Statistical Learning, and though I keep reading the section on calculating the ensemble weights I'm missing something.
It says that the stacking weights are given by
$hat{w}^{st} = underset{w}{argmin} sum_{i=1}^{N} left[ y_i – sum_{m=1}^{M} w_m hat{f}_m^{-i}(x_i)right]^2$
a regression where $w_i$ is the weight for model $m$, $hat{f}^{-i}_m(x)$ is the prediction at $x$ using model $m$ where the dataset has the $i$th observation removed.
How does one compute this in practice? If you fit a different $f_m()$ for each case of the leave-one-out, does the final ensemble require yet another $f_m()$ that's fit on all the data? If you only have one $f_m()$, do you simply concatenate all $n times n$ rows of predictions and use that as the data matrix for the regression that finds the $w$?
Best Answer
I'm not 100% sure, but for the sake of posterity here's the solution I came to –
- Fit each model $f_1 dots f_M$ on the full train set
- Assuming $n$ observations, refit $f_1 dots f_M$ $n$ times, each time leaving out a different observation from the training set
- Use these fits to generate predictions on the whole test set, producing ($M$ fits $times n$ leave-one-out subsets) predicted values. These become the columns of the prediction matrix, a column per model
- Using the response variable in the test set as the response, perform OLS on the prediction matrix. The resultant betas are the model weights $W$. (The authors suggest here that learning techniques beside OLS are suitable as well.)
- The final stacked model takes the objects $f_1 cdots f_M$ from step 1 and applies the corresponding weights $W_i$.
- Test on the holdout set.