Solved – Correlation Between Input and Output Variable

Let's say I want to model a problem with 5 input variables and 1 output variable. I have measured the correlation of each input with the output. 3 of the 5 inputs have correlation less than 0.1, and the remaining 2 inputs have correlation greater than 0.7. Should I use all input variables in my model, or only the two with high correlation?

Play with it a bit. Run some scatterplots to get a feel for your data.

In all likelihood, you'll find that since those two are so highly correlated with the output variable, they're also correlated with each other, possibly even more strongly. If so, putting both of them in the model will result in some multicollinearity, risking both of them being declared insignificant, and skewing the coefficient estimates.

Try running all subsets regression, since you're working with so few variables, and then pick the best performing combination!

Similar Posts:

Rate this post

Leave a Comment