My question: should I do CV even for a relatively large data set?
I have a relatively large data set and I will apply a machine learning algorithm to the data set.
Since my PC is not fast, the CV (and grid search) takes sometimes too long time. In particular a SVM never ends because of lots of tuning parameters. Thus if I do a CV, then I need to pick a relatively small data.
On the other hand the validation set should be also large, so I think that it is a good idea to use a validation set which has the same (or larger) size as the training set. (Namely instead of CV I use a large validation set for parameter tuning.)
So I have now at least two options.
- do CV on a small data set.
- use relatively large training set and validation set without CV.
- other idea.
What is the best idea? Theoretical or practical opinions are both welcome.
Best Answer
In general, you don't have to use cross validation all the time. Point of CV is to get more stable estimate of generalizability of your classifier that you would get using only one test set. You don't have to use CV if your data set is enormous, so adding data to your training set won't improve your model much, and few missclassifications in your test set just by random chance, won't really change your performance metric.
By having a small training set and a big test set, your estimation will be biased. So it will be probably worse than what you would get using more training data and optimal hyperparameters that you found might be different for bigger dataset, simply because more data will require less regularization.
However, getting optimal hyperparamters is not the important part anyway and it won't improve the performance dramatically. You should focus your energy to understand the problem, creating good features and getting data to good shape.
Here are few things you can consider to speed things up:
- Train it with fewer features. Use feature selection and/or dimensionality reduction to decrease the size of your problem
- Use precached kernel for SVM
- Use algorithms that doesn't need to select hyper parameters in a grid. Especially linear ones like logistic regression with ridge/lasso/elastic net penalty or even linear SVM. Depending on implementation, those classifiers can fit models for all hyperparameters in selected path for the cost of fitting only one
- use faster implementation for your type of problem (you will have to google it)
and even with slower computer, you can:
- Use more cores
- Use GPU