I am constructing a model for the prediction of a binary (Yes/No) outcome. I have a learning sample that gives the machine 1500 examples of the "Yes" group and 500 example of the "No" group. Should I be using all the data I have for input to learn the machine? Would this be biased towards the "Yes"?
I had the thought of giving 500 "Yes" and 500 "No" examples, but I am not sure if this is going to positively or negatively my future predictions.
Thanks.
Contents
hide
Best Answer
Most learning algorithms have a way to deal with skewed data sets. In general, use as much as you can for learning to increase generalization performance.
Similar Posts:
- Solved – Training data has categorical target variable, but I want to get numeric target variable for new samples
- Solved – Machine Learning Book (Python)
- Solved – How to one estimate compute requirements for Machine Learning algorithms
- Solved – How to one estimate compute requirements for Machine Learning algorithms
- Solved – Is data-driven modelling and machine learning the same thing