Solved – Is it ‘fair’ to set a seed in a random forest regression to yield the highest accuracy

I have a random forest regression built using skl and I note that I yield different results based on setting the random seed to different values.

If I use LOOCV to establish which seed works best, is this a valid method?

The answer is no.

Your model gives a different result for each seed you use. This is a result of the non-deterministic nature of the model. By choosing a specific seed that maximizes the performance on the validation set means that you chose the "arrangement" that best fits this set. However, this does not guarantee that the model with this seed would perform better on a separate test set. This simply means that you have overfit the model on the validation set.

This effect is the reason you see many people that rank high in competitions (e.g. kaggle) on the public test set, fall way off on the hidden test set. This approach is not considered by any means the correct approach.

Edit (not directly correlated to the answer, but I found it interesting)

You can find an interesting study showing the influence of random seeds in computer vision here. The authors first prove that you can achieve better results when using a better seed than the other and offer the critique that many of the supposed SOTA solutions could be merely better seed selection than the others. This is described in the same context as if it is cheating, which in all fairness it kind of is… Better seed selection does not make your model inherently better, it just makes it appear better on the specific test set.

Similar Posts:

Rate this post

Leave a Comment