Solved – Estimating prediction error

Would appreciate any answer on characterizing/estimating prediction error on future data for nonlinear regression problem. Under what conditions would cross validation error or simple test error on randomly selected 20% of data available be useful to characterize prediction error on new data (expected value, or max/min)? I've heard somewhere that cross validation error is an optimistic estimate, what would be a pessimistic (but somewhat tight upper bound) on prediction error?

If you have done cross-validation very carefully (there are many ways to make mistakes that can lead to overly optimistic results) then if your new data is drawn from the same population as the training data, the cross-validation result should be about right. In technical terms cross-validation should return an unbiased estimate of the error, so even if though the test result may vary from expectations, it should be just as likely to be better as it is to be worse.

For a good guide to cross-validation, see chapter 7 of Elements of Statistical Learning. A common mistake in cross-validation is to ensure that any choices you make developing the model such as tuning parameters, deciding which variables are useful and even what algorithm to use, needs to be evaluated via cross-validation.

However, the key assumption is that the test set is from the same population as the training set. In many real world applications of statistical models, the system being modelled is likely to change over time, even if it is in subtle ways such as changes in the ways samples are taken. Any change will degrade the performance of the model. For this reason, in practical terms cross-validation error on the static training set might be optimistic compared with how some system might perform in the real world. The details will depend entirely on the nature of the data, so there is no single quantitative answer to your question.

Similar Posts:

Rate this post

Leave a Comment