Solved – Why does the model consistently perform worse in cross-validation

Okay so I run this model manually and get around 80-90% accuracy:

mlp = MLPClassifier(hidden_layer_sizes=(     50, 50), activation="logistic", max_iter=500) mlp.out_activation_ = "logistic", Y_train) predictions = mlp.predict(X_test) print(confusion_matrix(Y_test, predictions)) print(classification_report(Y_test, predictions)) 

Then, I do some 10-fold cross validation:

print(cross_val_score(mlp, X_test, Y_test, scoring='accuracy', cv=10))

And I get accuracy stats something like the following for each fold:

[0.72527473 0.72222222 0.73333333 0.65555556 0.68888889 0.70786517
0.69662921 0.75280899 0.68539326 0.74157303]

I've done this about 5 times now. Every time I run the model on its own, I get 80-90% accuracy, but then when I run cross-validation, my model is averaging 10-20% less than when the model is run once manually.

The chances of getting the best model first time, five times in a row are 1 in 161,051 (1/11 ^ 5). So I must just be doing something wrong somewhere.

Why does my model consistently perform worse in cross-validation?

EDIT – I'd like to add that I'm doing exactly the same thing with a RandomForestClassifier() and getting expected results, i.e. the accuracy obtained when I run the model manually is around the same as when run by the cross_val_score() function. So what is it about my MLPClassifier() that's producing this mismatch in accuracy?

I think there is some confusion as to the basis of what is being observed here. First, a model is trained against the X_train/Y_train dataset. When testing this model against the X_test/Y_test (holdout) dataset, an accuracy of 80-90% is observed. Next, a cross-validation was run. This outputs a fold score based on the X_train/Y_train dataset.

The question asked was why the score of the holdout X_test/Y_test is different than the 10-fold scores of the training set X_train/Y_train. I believe the issue is that based on the code given in the question, the metrics are being obtained on different datasets. The 80-90% score comes from running mlp.predict() against the test dataset, while the 60-70% accuracy comes from obtaining fold scores for the train dataset.

Similar Posts:

Rate this post

Leave a Comment