Solved – Are Kaggle competitions just won by chance

Kaggle competitions determine final rankings based on a held-out test set.

A held-out test set is a sample; it may not be representative of the population being modeled. Since each submission is like a hypothesis, the algorithm that won the competition may just, by total chance, have ended up matching the test set better than the others. In other words, if a different test set were selected and the competition repeated, would the rankings remain the same?

For the sponsoring corporation, this doesn't really matter (probably the top 20 submissions would improve their baseline). Although, ironically, they might end up using a first-ranked model that is worse than the other top five. But, for the competition participants, it seems that Kaggle is ultimately a game of chance–luck isn't needed to stumble on the right solution, it's needed to stumble on the one that that matches the test set!

Is it possible to change the competition so that all the top teams who can't be statistically distinguished win? Or, in this group, could the most parsimonious or computationally cheap model win?

Yes, your reasoning is correct. If a different test set was selected and the competition repeated, rankings would indeed change. Consider the following example. All entries to a Kaggle competition with binary labels just guess randomly (and, say, independently) to predict their output. By chance, one of them will agree with the holdout more than others, even though no prediction is going on.

While this is a bit contrived, we can see that variance in each of the submission's models would mean that applying many such entries would indeed just be fitting to the noise of the holdout set. This tells us that (depending on the individual model variances), the top-N models probably generalize the same. This is the garden of forking paths, except the "researchers" aren't the same (but that doesn't matter).

Is it possible to change the competition so that all the teams who can't be statistically distinguished from the top performance on the test set win?


  • One approach (impractical as it is) would be to explicitly work out the variance of a given model in each entry, which would give us a CI on their holdout performance.
  • Another approach, which might take a lot of computation, is to bootstrap a CI on holdout performance, by exposing a training and testing API to all of the models.

Similar Posts:

Rate this post

Leave a Comment