# Solved – Proper variable selection: Use only training data or full data

I'm going through the lab exercises in "Introduction to Statistical Learning" and am having difficulty understanding the proper way to do best subset selection.

On page 248, it states that:

… We will now consider how to do this using the validation set and
cross-validation approaches. In order for these approaches to yield
accurate estimates of the test error, we must use only the training
observations to perform all aspects of model-fitting—including
variable selection
. Therefore, the determination of which model of a
given size is best must be made using only the training observations.
This point is subtle but important. If the full data set is used to
perform the best subset selection step, the validation set errors and
cross-validation errors that we obtain will not be accurate estimates
of the test error
.

However, this is followed by this from pg 249:

Finally, we perform best subset selection on the full data set,
and select the best ten-variable model. It is important that we make
use of the full data set in order to obtain more accurate coefficient
estimates
. Note that we perform best subset selection on the full
data set and select the best ten variable model, rather than simply
using the variables that were obtained from the training set, because
the best ten-variable model on the full data set may differ from the
corresponding model on the training set.

It seems that we use only the training set to determine the test errors that arise from having different numbers of variables in our models. Assuming we found a model with 10 variables to have the least error, we then use the full data to select the 10 best variables.

Why don't we use the training data throughout the feature selection process? Wouldn't the issue of using the full data set occur if we perform best subset selection as suggested?

Contents

The distinction here is between how to produce the final model for operational use and how to estimate the generalisation performance of that model.

If we are to get an unbiased performance estimate, we must use a sample of data that has not been used to tune any aspect of the model, which includes any feature selection, hyper-parameter tuning, or model selection steps. Thus when estimating the performance of the model, we need to make all of these choices using only the training data, so that the validation/test data remains "statistically pure".

However, we want the best possible model to use in operation, so once we have settled on a procedure to build the model, we rebuild it using the entire dataset so that we have the advantage of using a bit more data (which means the model parameters will be estimated a bit better).

This usually means that the performance estimate is a little pessimistic as it is really an estimate of the performance of a model trained on a sample of data as large as the training set, rather than of the full dataset. However, it is generally best to have an pessimistic estimate of performance than an optimistic one, which is what you would have if you used the test/validation data for feature, hyper-parameter or model selection.

Essentially in performance estimation, we are estimating the performance of a method for producing the final model, rather than the performance of the model itself.

Rate this post