Solved – Sklearn Learning Curve Example

As per scikit-learn's example on the interpretability of the learning curve, the following figure suggests that the SVM model requires more training examples to improve the validation score.
enter image description here

My question is, how is the need for more training examples justified when both the training score and the validation score are near perfect?

Why do they say more training cases are needed?

Training error being better than error on any kind of data that was independent of the training (test set, cross validation, out of bootstrap,…) is a symptom of overfitting, so the primary diagnosis here is overfitting.

One possibility to reduce overfitting is to get more training data.

Another possibility here would be to plot for, say, 1400 cases training and generalization error over γ to see at which γ the training error starts to move away from 0 (more restrictive model).

With near perfect cross validation results it is of course difficult to see whether training error starts to lower itself in direction of the CV error. Looking at the numbers may help here. I tend to consider both the absolute and difference between training and test performance and $frac{test~error}{training~error}$ ratio: there may still be an order of magnitude between them here.

Do we really need more cases?

In practice, I'd recommend to check the estimated performance against application needs – if you need better than 98% accuracy, you're fine here. If you need to demonstrate $geq$ 99.5% accuracy, you need more cases (probably many more, if you consider also the test sample size requirements for a one-sided confidence interval).

Similar Posts:

Rate this post

Leave a Comment