(5 points) Assume that we are interested in generating a model (e.g., a decision tree) from a
sample of examples of a specific size drawn from some distribution. Assume further that we
would like to investigate how sensitive the resulting model is to the actual choice of training
examples (i.e., how the performance varies over different sets of training examples of the
specific size). Assume that we have access to 100 training examples drawn from the underlying
distribution. If we are interested in investigating how the performance varies for models
generated from 90 examples, would we obtain a reliable estimate of the variance of the model
performance by performing a 10-fold cross-validation? Motivate your answer.
Is the 10-fold cross validation reliable for estimating the variance of model performance?
10 fold cross-validation is known to be a good way to get unbiased or nearly unbiased estimates of the error rates for classification / prediction based on a training set of a given size. If that is what you mean then the answer to your first question is yes.
If you mean by variance how the decision trees, which are different because the training samples differ, performance varies from one training sample of size 90 to another I am not sure. But I do think you could assess that by bootstrap.
- Solved – can I increase Model Complexity if I get a larger Training Set
- Solved – Leave one out and stratified 10-fold cross validation
- Solved – Logistic regression performs better on validation data
- Solved – Great examples of instrumental variable estimators
- Solved – When we should NOT use k-fold cross validation to assess the predictor