Solved – Sign change of a coefficient in logistic regression?

I am running a logistic regression with 5 continuous independent variables (IV). The problem is that IV4 when taken alone has a positive correlation with outcome (coeff > 0), and when taken with the other variables has a negative correlation (coeff < 0). I evaluated correlation between IV4 and the other variables, and the results are:
IV4 vs. IV1 (-0.51), IV4 vs. IV2 (-0.48), IV4 vs. IV3 (0.61) and IV4 vs. IV5 (0.73).

I ran other logistic regressions eliminating one at a time all the other variables to look if one of them was responsible for the sign change, and I noticed that when eliminating IV1, the sign of V4 coefficient became positive.

Thus, it seems that IV1 changes the sign of the coefficient of IV4.
Is there someone who knows what might be the cause and (possibly) the solution?

Practically, do I have to eliminate the IV4 (or IV1) from the model and explain why?

Thanks a lot for answering

Leonardo Frazzoni, MD

JohnK is correct that collinearity is the leading contender in explaining the wrong sign. Unfortunately, you can't look at zero-order (pairwise) correlations and learn anything about collinearity since the issues are deeper and are to be found in the partial or semi-partial correlations conditional on the model inputs.

There are a plethora of methods available to diagnose the causes of collinearity which range from VIFs (variance inflation factors) and residual analyses to eigenvalue decompositions. These multiple regression diagnostics have become widely available ever since Belsey, Kuh and Wallace's 80s book Regression Diagnostics. These MR diagnostics have their logistic regression analogue in a paper by Daryl Pregibon that dates from the 90s titled simply Logistic Regression Diagnostics. However (and this is the sort of thing that makes the more rigidly orthodox academic-types cringe), a suitable workaround would be to apply the more readily available MR diagnostics to your data preliminary to running your LR.

There are those who object to parameter level diagnostics, arguing that the issue is at the level of the overall model and, unless the standard errors are huge, the resulting model should be the focus: does it make theoretical sense? These papers tend to fall in fields where causal explanations are sought and desired (e.g., political science, see Understanding Interaction Models: Improving Empirical Analyses by Brambor, Clark and Golder, 2006). Bear in mind that, with these prescriptions, they are also standing nearly 30 years of research into regression diagnostics on its head. Moreover, these prescriptions are inconsistent (and/or are lightly documented as in the Brambor, et al, example) with much more rigorously documented conclusions based on earlier research (e.g., see Multiple Regression by Aiken and West, particularly wrt considerations regarding regression model interactions) or are guilty of thoughtlessly repeating erroneous myths about the causes of collinearity (e.g., Brambor, et al, state that collinearity is a "small sample* data problem).

So, do you drop the offending IV1 or IV4? Or, following Brambor, et al, check the overall model properties and, assuming the causal and theoretical appropriateness of the fully specified model, do nothing at all? Since I clearly do not subscribe to Brambor, et al's nostrums, I would elect to making a choice between dropping one of the IVs. For me, the choice has always been which one adds greater absolute relative importance to the model? In your case, with only 2 IVs, this is an easier choice and can be based on standardized metrics such as the t- or chi-squared values in the full model. The IV with the larger absolute value would be retained.

Similar Posts:

Rate this post

Leave a Comment