I am trying to classify the presence of a car in an image.For this purpose I have downloaded a Dataset containing the images of Cars.I need to know how to split this data-set into training,cross-validation and testing set.How to select which of the images to fall into what category(i.e. Testing Set or Cross Validation Set or Training Set).What is the percentage that I should split up to get the best results.
Best Answer
There is no correct percentage for training/test split. Common ratios are 80/20 and 70/30. Basically, you want to have a higher proportion in the training test in order to correctly ajust the model, then a smaller percentage to test on.
An important note is that the split should be random. Take 70% of your data randomly from the whole dataset, so to avoid bias in the sample. You can also sample the two categories separately (70% of the negative, 70% of the positives) to keep the same ratio between the positive/negatives.
I don't know Weka toolbox, so I can't give you the code. Any statistical software should allow for a random sample.
Side Note: with your sample size you could consider cross-validation or bootstrapping rather than training/test sampling.