Solved – Forward and backward stepwise regression (AIC) for negative binomial regression (with real data)

I am doing some count data analysis. The data is in this link.

Column A is the count data, and other columns are the independent variables. At first I used Poisson regression to analyze it:

``m0<-glm(A~.,data=d,family="poisson") summary(m0) ``

We see that the residual deviance is greater than the degrees of freedom so that we have over-dispersion:

``Call: glm(formula = A ~ ., family = "poisson", data = d)  Deviance Residuals:       Min        1Q    Median        3Q       Max   -28.8979   -4.5110    0.0384    5.4327   20.3809    Coefficients:               Estimate Std. Error z value Pr(>|z|)     (Intercept)  8.7054842  0.9100882   9.566  < 2e-16 *** B           -0.1173783  0.0172330  -6.811 9.68e-12 *** C            0.0864118  0.0182549   4.734 2.21e-06 *** D            0.1169891  0.0301960   3.874 0.000107 *** E            0.0738377  0.0098131   7.524 5.30e-14 *** F            0.3814588  0.0093793  40.670  < 2e-16 *** G           -0.3712263  0.0274347 -13.531  < 2e-16 *** H           -0.0694672  0.0022137 -31.380  < 2e-16 *** I           -0.0634488  0.0034316 -18.490  < 2e-16 *** J           -0.0098852  0.0064538  -1.532 0.125602     K           -0.1105270  0.0128016  -8.634  < 2e-16 *** L           -0.3304606  0.0155454 -21.258  < 2e-16 *** M            0.2274175  0.0259872   8.751  < 2e-16 *** N            0.2922063  0.0174406  16.754  < 2e-16 *** O            0.1179708  0.0119332   9.886  < 2e-16 *** P            0.0618776  0.0260646   2.374 0.017596 *   Q           -0.0303909  0.0060060  -5.060 4.19e-07 *** R           -0.0018939  0.0037642  -0.503 0.614864     S            0.0383040  0.0065841   5.818 5.97e-09 *** T            0.0318111  0.0116611   2.728 0.006373 **  U            0.2421129  0.0145502  16.640  < 2e-16 *** V            0.1782144  0.0090858  19.615  < 2e-16 *** W           -0.5105135  0.0258136 -19.777  < 2e-16 *** X           -0.0583590  0.0043641 -13.373  < 2e-16 *** Y           -0.1554609  0.0042604 -36.489  < 2e-16 *** Z            0.0064478  0.0001184  54.459  < 2e-16 *** AA           0.3880479  0.0164929  23.528  < 2e-16 *** AB           0.1511362  0.0050471  29.945  < 2e-16 *** AC           0.0557880  0.0181129   3.080 0.002070 **  AD          -0.6569099  0.0368771 -17.813  < 2e-16 *** AE          -0.0040679  0.0003960 -10.273  < 2e-16 *** --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  (Dispersion parameter for poisson family taken to be 1)      Null deviance: 97109.0  on 56  degrees of freedom Residual deviance:  5649.7  on 26  degrees of freedom AIC: 6117.1  Number of Fisher Scoring iterations: 6 ``

Then I think I should use negative binomial regression for the over-dispersion data. Since you can see I have many independent variables, and I wanted to select the important variables. And I decide to use stepwise regression to select the independent variable. At first, I create a full model:

``full.model <- glm.nb(A~., data=d,maxit=1000) # when not indicating maxit, or maxit=100, it shows Warning messages: 1: glm.fit: algorithm did not converge; 2: In glm.nb(A ~ ., data = d, maxit = 100) : alternation limit reached ``

When indicating `maxit=1000`, the warning message disappears.

``summary(full.model)  Call: glm.nb(formula = A ~ ., data = d, maxit = 1000, init.theta = 2.730327193,      link = log)  Deviance Residuals:      Min       1Q   Median       3Q      Max   -2.5816  -0.8893  -0.3177   0.4882   1.9073    Coefficients:               Estimate Std. Error z value Pr(>|z|)    (Intercept) 11.8228596  8.3004322   1.424  0.15434    B           -0.2592324  0.1732782  -1.496  0.13464    C            0.2890696  0.1928685   1.499  0.13393    D            0.3136262  0.3331182   0.941  0.34646    E            0.3764257  0.1313142   2.867  0.00415 ** F            0.3257785  0.1448082   2.250  0.02447 *  G           -0.7585881  0.2343529  -3.237  0.00121 ** H           -0.0714660  0.0343683  -2.079  0.03758 *  I           -0.1050681  0.0357237  -2.941  0.00327 ** J            0.0810292  0.0566905   1.429  0.15291    K            0.2582978  0.1574582   1.640  0.10092    L           -0.2009784  0.1543773  -1.302  0.19296    M           -0.2359658  0.3216941  -0.734  0.46325    N           -0.0689036  0.1910518  -0.361  0.71836    O            0.0514983  0.1383610   0.372  0.70974    P            0.1843138  0.3253483   0.567  0.57105    Q            0.0198326  0.0509651   0.389  0.69717    R            0.0892239  0.0459729   1.941  0.05228 .  S           -0.0430981  0.0856391  -0.503  0.61479    T            0.2205653  0.1408009   1.567  0.11723    U            0.2450243  0.1838056   1.333  0.18251    V            0.1253683  0.0888411   1.411  0.15820    W           -0.4636739  0.2348172  -1.975  0.04831 *  X           -0.0623290  0.0508299  -1.226  0.22011    Y           -0.0939878  0.0606831  -1.549  0.12142    Z            0.0019530  0.0015143   1.290  0.19716    AA          -0.2888123  0.2449085  -1.179  0.23829    AB           0.1185890  0.0696343   1.703  0.08856 .  AC          -0.3401963  0.2047698  -1.661  0.09664 .  AD          -1.3409002  0.4858741  -2.760  0.00578 ** AE          -0.0006299  0.0051338  -0.123  0.90234    --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  (Dispersion parameter for Negative Binomial(2.7303) family taken to be 1)      Null deviance: 516.494  on 56  degrees of freedom Residual deviance:  61.426  on 26  degrees of freedom AIC: 790.8  Number of Fisher Scoring iterations: 1                 Theta:  2.730            Std. Err.:  0.537    2 x log-likelihood:  -726.803  ``

When not indicating `maxit`, or `maxit=100`, it shows

Warning messages: 1: glm.fit: algorithm did not converge; 2: In glm.nb(A ~ ., data = d, maxit = 100) : alternation limit reached.

When indicating `maxit=1000`, the warning message disappears.

Then I create a first model:

``first.model <- glm.nb(A ~ 1, data = d) ``

Then I tried the forward stepwise regression:

``step.model <- step(first.model, direction="forward", scope=formula(full.model)) ``

Error in glm.fit(X, y, wt, offset = offset, family = object\$family, control = object\$control) :
NA/NaN/Inf in 'x'
step size truncated due to divergence

What is the problem?

I also tried the backward regression:

``step.model2 <- step(full.model,direction="backward")  #the final step Step:  AIC=770.45 A ~ B + C + E + F + G + H + I + K + L + R + T + V + W + Y + AA +      AB + AD         Df Deviance    AIC <none>      62.375 770.45 - AB    1   64.859 770.93 - H     1   65.227 771.30 - V     1   65.240 771.31 - L     1   65.291 771.36 - Y     1   65.831 771.90 - B     1   66.051 772.12 - C     1   67.941 774.01 - AA    1   69.877 775.95 - K     1   70.411 776.48 - W     1   71.526 777.60 - I     1   71.863 777.94 - E     1   72.338 778.41 - G     1   73.344 779.42 - F     1   73.510 779.58 - AD    1   79.620 785.69 - R     1   80.358 786.43 - T     1   95.725 801.80 Warning messages: 1: glm.fit: algorithm did not converge  2: glm.fit: algorithm did not converge  3: glm.fit: algorithm did not converge  4: glm.fit: algorithm did not converge  ``

My question is: Why it is different in using forward and backward stepwise regression? And why do I get the error message when performing forward selection? Also, what exactly do these warning messages mean? And how should I deal with it?

I am not a stats person but need to conduct statical analysis for my research data. So I am struggling in learning how to do different regression analyses using real data. I searched online for similar questions but I still could understand … And please let me know if I did anything wrong in my regression analysis. I would really appreciate it if you could help me with these questions!

Contents

I have good news and bad news.

good news

• you can probably more or less disregard the warnings. Where stepwise regression is recommended at all (see below …), backward regression is probably better than forward regression anyway.
• you can do forward and backward stepwise regression with `MASS::stepAIC()` (instead of `step`).

• `step` probably isn't doing what you think it's doing anyway. Rather than refitting the negative binomial dispersion parameter, it's re-fitting with a fixed overdispersion parameter, which is probably not what you want (there's a classically snarky e-mail from Prof. Brian Ripley from 2006 here that discusses this issue in passing). As mentioned above, `stepAIC()` works better.
• if you are only interested in predictive accuracy, and not in anything about confidence intervals or hypothesis tests or measuring variable importance … then stepwise regression might be OK (Murtaugh 2009) …
• but if you care at all about being able to make any inferences about the effects of the parameters, you have too many variables and not enough data. A rule of thumb is that (1) you need at least 10 times as many data points as predictor variables to do reliable inference and (2) doing any inference after selecting variables (via stepwise selection or otherwise) is very wrong [unless you do super-cutting-edge stuff that only works with huge data sets and very strong assumptions].

The big question here is: why do you want to do variable selection in the first place?

• you're only interested in prediction: OK, but something like penalized regression (Dahlgren 2010) will probably work better
• you're interested in inference: this is going to be tough; you almost certainly don't have enough data to tell the effects of correlated variables apart. In your situation I would probably compute the principal components (PCA) of the predictor variables and use only the first 5 (which fall within the $$n/10$$ rule, and explain 99.5% of the variance in the predictors …)

Murtaugh, Paul A. “Performance of Several Variable-Selection Methods Applied to Real Ecological Data.” Ecology Letters 12, no. 10 (October 2009): 1061–68. https://doi.org/10.1111/j.1461-0248.2009.01361.x.

Dahlgren, Johan P. “Alternative Regression Methods Are Not Considered in Murtaugh (2009) or by Ecologists in General.” Ecology Letters 13, no. 5 (May 1, 2010): E7–9. https://doi.org/10.1111/j.1461-0248.2010.01460.x.

Rate this post