Solved – Splitting the dataset into Testing,Cross Validation and Training Set

I am trying to classify the presence of a car in an image.For this purpose I have downloaded a Dataset containing the images of Cars.I need to know how to split this data-set into training,cross-validation and testing set.How to select which of the images to fall into what category(i.e. Testing Set or Cross Validation Set or Training Set).What is the percentage that I should split up to get the best results.

There is no correct percentage for training/test split. Common ratios are 80/20 and 70/30. Basically, you want to have a higher proportion in the training test in order to correctly ajust the model, then a smaller percentage to test on.

An important note is that the split should be random. Take 70% of your data randomly from the whole dataset, so to avoid bias in the sample. You can also sample the two categories separately (70% of the negative, 70% of the positives) to keep the same ratio between the positive/negatives.

I don't know Weka toolbox, so I can't give you the code. Any statistical software should allow for a random sample.

Side Note: with your sample size you could consider cross-validation or bootstrapping rather than training/test sampling.

Similar Posts:

Rate this post

Leave a Comment