Solved – Should I use GridSearchCV on all of the data? Or just the training set

I have a dataset and my intention would be to predict a binary variable using DecisionTreeClassifier.

dtc = DecisionTreeClassifier() parameter_grid = {'criterion': ['gini', 'entropy'],                   'splitter': ['best', 'random'],                   'max_depth': range(1,25),                   'min_samples_split': range(2,30),                   'max_features': range(1,10)}  skf = StratifiedKFold(n_splits=15) grid_search = GridSearchCV(dtc, param_grid=parameter_grid, scoring='recall', cv=skf), Y) print('Best score: {}'.format(grid_search.best_score_)) print('Best parameters: {}'.format(grid_search.best_params_)) #grid_search.grid_scores_ 

Using GridSearch I can find the best set of parameters of my model. The Score in output is the mean score on the test set? I am not understanding how GridSearch finds the best parameters using Kfold or StratifiedKfold.
In this case X and Y represent all my database, with X predictors and Y target (0,1).
So, when I run,Y) 

is it correct to pass the entire dataset then using grid_search.best_estimator_ again on X or I should split before the database in train test, pass only train to the grid_search and then run grid_search.best_estimator_ on the test set?

I think it's important to step back and consider the purpose of breaking your data into a training and test set in the first place.

Ultimately, your goal is to build a model that will perform the best on a new set of data, given that it is trained on the data you have. One way to evaluate how well your model will perform on a new set of data is to break off some of your data into a "test" set, and only build your model on the remaining "training" set. Then, you can apply the model to your test set, and see how well it does in its prediction, with the belief that it will perform similarly to how it would perform on a new set of data.

Technically speaking, there's nothing wrong with doing grid search to tune hyperparameters on all of your data; you're free to build a model however you want. But by using grid search on all of your data, you are defeating the purpose of doing a training/test split. That's because if you do the training/test split after doing grid search on all of your data to tune hyperparameters, applying your model to the test set no longer gives an estimate of how well your model will perform on new data, since your model has seen the test set, in the sense that the data in the test set was used to tune the hyperparameters.

As a result, if you do grid search on all of your data, the error on your test set will be biased low, and when you go to apply your model to new data, the error could be much higher (and likely will, except for the effects of randomness).

In summary, you should only use gridsearch on the training data after doing the train/test split, if you want to use the performance of the model on the test set as a metric for how your model will perform when it really does see new data.

Similar Posts:

Rate this post

Leave a Comment