Solved – Underfitting? Validation scores above training scores

I'm plotting a learning curve currently using the following code:

estimator = LogisticRegression() train_sizes, train_scores, valid_scores = learning_curve(estimator,  X=X_train, y=labels, cv=10)  train_mean = np.mean(train_scores, axis=1) train_std = np.std(train_scores, axis=1) valid_mean = np.mean(valid_scores, axis=1) valid_std = np.std(valid_scores, axis=1) plt.plot(train_sizes, train_mean, color='blue', marker='o',markersize=5, label='training accuracy') plt.plot(train_sizes, valid_mean, color='red', marker='o', markersize=5, label='valid accuracy') 

Log Regression Learning curve

Above is what is plotted. I'm assuming this is due to some sort of underfitting, or we need to better diagnose what's happening with the model?

For references, below is the learning curve if I use a DecisionTreeClassifier instead (the rest of the code being the same). This is much more what I'd expect, so I'm leaning towards Log Regression underfitting.

Decision Tree Learning Curve

If anyone can help explain how to interpret what the learning curves for log regression is telling me and in how to interpret if the validation accuracy is lower than the training accuracy, that'd be much appreciated.

I'm assuming this is due to some sort of underfitting, or we need to better diagnose what's happening with the model?

I would agree that it is under-fitting for the Logistic regression (first case) and over-fitting for the DecisionTreeClassifier (second case).

With regards to your original question:

Validation scores above training scores

This can happen because of the randomness of the data within your training/validation folds. For example, say you have a ZeroR classification model that only predicts one class, e.g. $(1, 1, 1, ldots)$. Well, if your training dataset contains 60% of $1$s and your validation dataset contains 70% of $1$s, your training accuracy = 60%, but your validation accuracy = 70%. As you train/validate over different folds, the variations in the sampling of labels can cause the train/validation accuracies to "invert".

I also think your observation that the validation scores are higher than the training scores are accentuated because the scales on the two graphs are different (the range in the first graph is very small compared to the second). If you re-plot the first graph with the same range as the second, you might not notice that behaviour as easily. You may find your Logit model is actually just taking the most-dominant class (c.f. ZeroR) as a proxy for its prediction; hence your under-fitting ("high-bias") problem.

Similar Posts:

Rate this post

Leave a Comment