Say I have a regression model that looks as follows. The goal is to predict credit card balance given a number of independent variables.
This is just the first pass at the model and no attempt as yet been made to optimize it. I'm curious when the best time is to do a multicollinearity test. Is it now before we go any further or should it occur after we've narrowed down to what we think will be our final independent variables?
Best Answer
I don't think it matters much. Checking it later will save you unnecessary work and agitation at needless transformations that might prove pointless if the variables won't be in the final model. That being said, checking vif(model)
is not time consuming, and you can always wait with the application of solutions to potential multicollinearity until later.
The problem of multicollinearity is that it can distort the affected coefficients, change their signs and their significance. The 'good' thing (should say convenient) about multicollinearity it is that it affects only the collinear variables – yet does not affect he rest of the variables. This means that if collinear it's only exists on control variables, it often OK to disregard it.
You can check, see if it's on controls. If it is, optimize and leave it. If on main explanatory variables, deal now before optimizing (a common way is centering – which can be done using scale(var_to_scale, scale = FALSE)
Edit: the answer by @user3640761 rises a valid suggestion, that you check for high correlations in your data before doing anything else. It's easy, fast, and can give a good indication.
Similar Posts:
- Solved – Multicollinearity in simple linear regression (not multiple)
- Solved – Why is multicollinearity so bad for machine learning models and what can we do about it
- Solved – High correlation between two independent variables, but no multicollinearity
- Solved – How seriously should I consider the effects of multicollinearity in the regression model
- Solved – How seriously should I consider the effects of multicollinearity in the regression model