I get enormous coefficients during logistic regression, see coefficients with krajULKV
:
> summary(m5) Call: glm(formula = cbind(ml, ad) ~ rok + obdobi + kraj + resid_usili2 + rok:obdobi + rok:kraj + obdobi:kraj + kraj:resid_usili2 + rok:obdobi:kraj, family = "quasibinomial") Deviance Residuals: Min 1Q Median 3Q Max -2.7796 -1.0958 -0.3101 1.0034 2.8370 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -486.72087 664.71911 -0.732 0.46424 rok 0.24232 0.33114 0.732 0.46452 obdobinehn 3400.43703 1354.14874 2.511 0.01223 * krajJHC 786.22409 708.50291 1.110 0.26746 krajJHM 511.85538 823.03038 0.622 0.53417 krajLBK -23.94180 2388.86316 -0.010 0.99201 krajMSK 1281.88767 955.09736 1.342 0.17992 krajOLK -175.19425 1255.82946 -0.140 0.88909 krajPAK 349.76438 1071.03364 0.327 0.74408 krajPLK -1335.73206 1534.09899 -0.871 0.38418 krajSTC 868.99157 692.30426 1.255 0.20976 krajULKV 245661.86828 17496742.31677 0.014 0.98880 krajVYS 3341.76686 1314.77140 2.542 0.01121 * krajZLK 3950.75617 2922.25220 1.352 0.17676 resid_usili2 -1.44719 0.89315 -1.620 0.10555 rok:obdobinehn -1.69479 0.67462 -2.512 0.01219 * rok:krajJHC -0.39108 0.35295 -1.108 0.26817 rok:krajJHM -0.25481 0.40997 -0.622 0.53443 rok:krajLBK 0.01621 1.19155 0.014 0.98915 rok:krajMSK -0.63985 0.47592 -1.344 0.17917 rok:krajOLK 0.08714 0.62545 0.139 0.88923 rok:krajPAK -0.17419 0.53344 -0.327 0.74410 rok:krajPLK 0.66539 0.76383 0.871 0.38394 rok:krajSTC -0.43292 0.34490 -1.255 0.20976 rok:krajULKV -122.01076 8704.03367 -0.014 0.98882 rok:krajVYS -1.66391 0.65468 -2.542 0.01122 * rok:krajZLK -1.96718 1.45474 -1.352 0.17667 obdobinehn:krajJHC -3623.86807 1385.86009 -2.615 0.00909 ** obdobinehn:krajJHM -3220.08906 1458.83842 -2.207 0.02757 * obdobinehn:krajLBK -1051.07131 3434.11845 -0.306 0.75963 obdobinehn:krajMSK -6415.65781 1978.30260 -3.243 0.00123 ** obdobinehn:krajOLK -2427.66591 1777.51914 -1.366 0.17239 obdobinehn:krajPAK -3111.45312 1623.59145 -1.916 0.05566 . obdobinehn:krajPLK -1800.26258 2065.74461 -0.871 0.38375 obdobinehn:krajSTC -4409.45624 1379.64196 -3.196 0.00145 ** obdobinehn:krajULKV -187832.68360 16454272.74951 -0.011 0.99089 obdobinehn:krajVYS -5445.51446 1791.38012 -3.040 0.00244 ** obdobinehn:krajZLK -6216.43343 3167.49836 -1.963 0.05003 . krajJHC:resid_usili2 1.60474 0.98554 1.628 0.10385 krajJHM:resid_usili2 1.57822 1.04518 1.510 0.13143 krajLBK:resid_usili2 11.53462 13.40012 0.861 0.38961 krajMSK:resid_usili2 -1.33600 1.55241 -0.861 0.38971 krajOLK:resid_usili2 0.07296 1.27034 0.057 0.95421 krajPAK:resid_usili2 1.35880 1.23033 1.104 0.26974 krajPLK:resid_usili2 1.90189 1.41163 1.347 0.17826 krajSTC:resid_usili2 2.05237 0.95972 2.139 0.03277 * krajULKV:resid_usili2 599.79215 20568.86123 0.029 0.97674 krajVYS:resid_usili2 3.03834 1.16464 2.609 0.00925 ** krajZLK:resid_usili2 1.18574 1.11024 1.068 0.28583 rok:obdobinehn:krajJHC 1.80611 0.69042 2.616 0.00906 ** rok:obdobinehn:krajJHM 1.60475 0.72676 2.208 0.02751 * rok:obdobinehn:krajLBK 0.52268 1.71244 0.305 0.76027 rok:obdobinehn:krajMSK 3.19712 0.98564 3.244 0.00123 ** rok:obdobinehn:krajOLK 1.21012 0.88541 1.367 0.17208 rok:obdobinehn:krajPAK 1.55034 0.80886 1.917 0.05563 . rok:obdobinehn:krajPLK 0.89718 1.02893 0.872 0.38349 rok:obdobinehn:krajSTC 2.19742 0.68732 3.197 0.00144 ** rok:obdobinehn:krajULKV 93.43130 8189.24994 0.011 0.99090 rok:obdobinehn:krajVYS 2.71357 0.89236 3.041 0.00243 ** rok:obdobinehn:krajZLK 3.09624 1.57711 1.963 0.04996 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasibinomial family taken to be 1.258421) Null deviance: 1518.0 on 878 degrees of freedom Residual deviance: 1228.6 on 819 degrees of freedom (465 observations deleted due to missingness) AIC: NA Number of Fisher Scoring iterations: 18
What does this mean?? Does it mean some multicollinearity, like @Scortchi mentioned in this discussion? Or does this mean overfitting? How to detect the problem? What shall I do now?
I tried to remove some variables. This helps a bit but not so much:
> m6 <- update(m5, ~.- kraj:resid_usili2) > m7 <- update(m6, ~.- resid_usili2) > summary(m7) Call: glm(formula = cbind(ml, ad) ~ rok + obdobi + kraj + rok:obdobi + rok:kraj + obdobi:kraj + rok:obdobi:kraj, family = "quasibinomial") Deviance Residuals: Min 1Q Median 3Q Max -2.9098 -1.1931 -0.2274 1.0529 3.1283 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -118.95199 476.34698 -0.250 0.803 rok 0.05971 0.23718 0.252 0.801 obdobinehn 412.69412 646.95083 0.638 0.524 krajJHC 447.69791 498.45358 0.898 0.369 krajJHM -62.92516 525.85737 -0.120 0.905 krajLBK 677.73239 1595.20024 0.425 0.671 krajMSK 278.24639 621.32312 0.448 0.654 krajOLK -705.97832 782.53474 -0.902 0.367 krajPAK 387.96543 608.98961 0.637 0.524 krajPLK -653.68419 782.20737 -0.836 0.403 krajSTC -114.34822 489.06318 -0.234 0.815 krajULKV -2117.64674 1797.75836 -1.178 0.239 krajVYS 884.74411 681.05324 1.299 0.194 krajZLK -997.77613 925.93280 -1.078 0.281 rok:obdobinehn -0.20602 0.32211 -0.640 0.523 rok:krajJHC -0.22303 0.24819 -0.899 0.369 rok:krajJHM 0.03092 0.26180 0.118 0.906 rok:krajLBK -0.33909 0.79438 -0.427 0.670 rok:krajMSK -0.13889 0.30935 -0.449 0.654 rok:krajOLK 0.35102 0.38943 0.901 0.368 rok:krajPAK -0.19392 0.30323 -0.640 0.523 rok:krajPLK 0.32463 0.38937 0.834 0.405 rok:krajSTC 0.05677 0.24351 0.233 0.816 rok:krajULKV 1.05287 0.89453 1.177 0.239 rok:krajVYS -0.44149 0.33911 -1.302 0.193 rok:krajZLK 0.49612 0.46081 1.077 0.282 obdobinehn:krajJHC -776.31258 672.68911 -1.154 0.249 obdobinehn:krajJHM -267.78650 700.38741 -0.382 0.702 obdobinehn:krajLBK -1246.67321 1760.37329 -0.708 0.479 obdobinehn:krajMSK -383.77613 858.81391 -0.447 0.655 obdobinehn:krajOLK -96.72334 947.75189 -0.102 0.919 obdobinehn:krajPAK -540.25140 827.13134 -0.653 0.514 obdobinehn:krajPLK -517.49161 1124.63474 -0.460 0.645 obdobinehn:krajSTC -683.81160 672.66674 -1.017 0.310 obdobinehn:krajULKV 2344.32314 2073.98366 1.130 0.259 obdobinehn:krajVYS -795.62043 917.80551 -0.867 0.386 obdobinehn:krajZLK 618.33075 1093.37768 0.566 0.572 rok:obdobinehn:krajJHC 0.38725 0.33493 1.156 0.248 rok:obdobinehn:krajJHM 0.13374 0.34870 0.384 0.701 rok:obdobinehn:krajLBK 0.62237 0.87662 0.710 0.478 rok:obdobinehn:krajMSK 0.19114 0.42758 0.447 0.655 rok:obdobinehn:krajOLK 0.04842 0.47171 0.103 0.918 rok:obdobinehn:krajPAK 0.26922 0.41184 0.654 0.513 rok:obdobinehn:krajPLK 0.25790 0.55986 0.461 0.645 rok:obdobinehn:krajSTC 0.34078 0.33492 1.017 0.309 rok:obdobinehn:krajULKV -1.16571 1.03236 -1.129 0.259 rok:obdobinehn:krajVYS 0.39675 0.45704 0.868 0.386 rok:obdobinehn:krajZLK -0.30732 0.54422 -0.565 0.572 (Dispersion parameter for quasibinomial family taken to be 1.313286) Null deviance: 2396.8 on 1343 degrees of freedom Residual deviance: 2110.3 on 1296 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 5
EDIT: As proposed by Scortchi, I tried to use VIF and I also get enormous values. What does this mean? See:
> require(HH) > vif(cbind(ml, ad) ~ rok + obdobi + kraj + resid_usili2 + + rok:obdobi + rok:kraj + obdobi:kraj + kraj:resid_usili2 + + rok:obdobi:kraj) rok obdobinehn krajJHC krajJHM 50.281603 45075363.969712 15194580.406796 11362184.620230 krajLBK krajMSK krajOLK krajPAK 7567915.376763 5228018.864051 17105623.986998 10944471.683601 [... cut out ...]
Best Answer
I would suggest that the massive coefficients, and the correspondingly massive standard errors, would almost definitely be caused by quasi-complete or complete separation. That is, for some combination of parameters, either everyone had the outcome or nobody had the outcome, and so the coefficient heads towards infinity (or negative infinity.)
This tends to happen especially when one specifies a lot of interaction terms, as the chances of having a combination of factors which results in some "empty" (no outcomes in cell, or everyone has outcomes) cells will increase.
See the following page for some further details and suggested strategies (link updated March 2021): https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-complete-or-quasi-complete-separation-in-logisticprobit-regression-and-how-do-we-deal-with-them/
More generally, it means that you're probably trying to do "too much" with your model for the size of your dataset (particularly the number of outcomes observed).
EDIT: A couple of pragmatic suggestions
You might try (1) quick and simple: drop the interaction terms from your model, to see if that helps (whether this makes sense from a research question perspective is an entirely different issue); or (2) get R to make you a bi-i-i-i-g contingency table for (e.g. rows) the combinations described in the interactions by (e.g. columns) the outcome variable. You might be able to see some evidence of separation here.
Similar Posts:
- Solved – Interaction suppresses the main effect? How to interpret it
- Solved – Interaction suppresses the main effect? How to interpret it
- Solved – Likelihood ratio test on a single model
- Solved – Likelihood ratio test on a single model
- Solved – Why ChiSq test and Likelihood Ratio Test are the same when comparing to logistic regression models