I'm having trouble understanding exactly how to obtain the regularization parameter when pruning a decision tree with the minimal cost complexity approach. Assume the cost complexity function is represented as
$$C(T) = R(T) + alpha|T|,$$
where $alpha$ is the regularization parameter to be chosen.
Utilizing the entire data set, We now use weakest link cutting to obtain a set of $alpha$'s and the corresponding sub-trees which minimize the cost for a given $alpha$.
Next, we generally use a K-fold cross-validation. This is where the pruning approach becomes unclear to me. For K-fold CV we estimate K trees. Next, I would think we would use the original $alpha$'s which we obtain from the entire sample to identify the sequence of optimal sub-trees in each fold. We would then proceed with CV, selecting the $alpha$ with corresponding smallest average error.
However, several sources (These lecture notes and Intro to Stats Learning Pg 309) seem to suggest that within each fold a new set of $alpha$'s are obtained. Let's refer to the set of $alpha$'s obtained within the kth fold as $alpha^{(k)}$. This does not make sense to me. It is not likely that each entry within the set $alpha$ (i.e. the set of $alpha$'s obtained from the entire data set) will be equivalent to $alpha^{(k)}$ of even that the elements of $alpha^{(k)}$ will be equivalent to $alpha^{(j)}$. How can we pick the entry of $alpha$ that minimize cost when $alpha^{(k)}$ potentially share no similar entries with $alpha$?
Best Answer
It is as you say. For each of the K folds you obtain a sequence $alpha^{(k)}$. Each of these sequences is in general different. Now, let $alpha^*$ be the "union" of all the sequences: in other words, $alpha^*$ es the set of all values of the cost-complexity parameter at which in at least one of the folds we transition from one tree to another.
The idea is then to compute the cross-validated error for all values half way between contiguous values of $alpha^*$ (I seem to recall that in the original reference on CART the authors propose the geometric mean of contiguous $alpha^*$) and pick the value which makes such cross-validated error minimum. With that value of $alpha$ you go and prune the tree based on the whole sample.