Let's say I want to model a problem with 5 input variables and 1 output variable. I have measured the correlation of each input with the output. 3 of the 5 inputs have correlation less than 0.1, and the remaining 2 inputs have correlation greater than 0.7. Should I use all input variables in my model, or only the two with high correlation?
Play with it a bit. Run some scatterplots to get a feel for your data.
In all likelihood, you'll find that since those two are so highly correlated with the output variable, they're also correlated with each other, possibly even more strongly. If so, putting both of them in the model will result in some multicollinearity, risking both of them being declared insignificant, and skewing the coefficient estimates.
Try running all subsets regression, since you're working with so few variables, and then pick the best performing combination!
- Solved – How to run a sensitivity analysis with dependent variables
- Solved – What’s the difference between a single output RNN and a MLP whose input data contains all the features of given time steps
- Solved – Comparing and Interpreting covariances
- Solved – What does alignment between input and output mean for recurrent neural network
- Solved – Slack values in Data Envelopment Analysis