Solved – Linear regresson lm or stepwise regression here using R

It is a basic question but I could not find clear answer on my reading. I am trying to find independent predictors of Infant.Mortality in data frame 'swiss' in R.

> head(swiss)              Fertility Agriculture Examination Education Catholic Infant.Mortality Courtelary        80.2        17.0          15        12     9.96             22.2 Delemont          83.1        45.1           6         9    84.84             22.2 Franches-Mnt      92.5        39.7           5         5    93.40             20.2 Moutier           85.8        36.5          12         7    33.77             20.3 Neuveville        76.9        43.5          17        15     5.16             20.6 Porrentruy        76.1        35.3           9         7    90.57             26.6 

Following are the results using lm and I find only Fertility to be a significant predictor:

> fit = lm(Infant.Mortality~., data=swiss) > summary(fit)  Call: lm(formula = Infant.Mortality ~ ., data = swiss)  Residuals:     Min      1Q  Median      3Q     Max  -8.2512 -1.2860  0.1821  1.6914  6.0937   Coefficients:               Estimate Std. Error t value Pr(>|t|) (Intercept)  8.667e+00  5.435e+00   1.595  0.11850 Fertility    1.510e-01  5.351e-02   2.822  0.00734    #  <<<< NOTE P VALUE HERE Agriculture -1.175e-02  2.812e-02  -0.418  0.67827 Examination  3.695e-02  9.607e-02   0.385  0.70250 Education    6.099e-02  8.484e-02   0.719  0.47631 Catholic     6.711e-05  1.454e-02   0.005  0.99634  Residual standard error: 2.683 on 41 degrees of freedom Multiple R-squared:  0.2439,    Adjusted R-squared:  0.1517  F-statistic: 2.645 on 5 and 41 DF,  p-value: 0.03665 

Following are the graphs:

plot(fit) 

enter image description here

On performing stepwise regression, following are the results:

> step <- stepAIC(fit, direction="both");  Start:  AIC=98.34 Infant.Mortality ~ Fertility + Agriculture + Examination + Education +      Catholic                Df Sum of Sq    RSS     AIC - Catholic     1     0.000 295.07  96.341 - Examination  1     1.065 296.13  96.511 - Agriculture  1     1.256 296.32  96.541 - Education    1     3.719 298.79  96.930 <none>                     295.07  98.341 - Fertility    1    57.295 352.36 104.682  Step:  AIC=96.34 Infant.Mortality ~ Fertility + Agriculture + Examination + Education                Df Sum of Sq    RSS     AIC - Examination  1     1.320 296.39  94.551 - Agriculture  1     1.395 296.46  94.563 - Education    1     5.774 300.84  95.252 <none>                     295.07  96.341 + Catholic     1     0.000 295.07  98.341 - Fertility    1    72.609 367.68 104.681  Step:  AIC=94.55 Infant.Mortality ~ Fertility + Agriculture + Education                Df Sum of Sq    RSS     AIC - Agriculture  1     4.250 300.64  93.220 - Education    1     6.875 303.26  93.629 <none>                     296.39  94.551 + Examination  1     1.320 295.07  96.341 + Catholic     1     0.255 296.13  96.511 - Fertility    1    79.804 376.19 103.758  Step:  AIC=93.22 Infant.Mortality ~ Fertility + Education                Df Sum of Sq    RSS     AIC <none>                     300.64  93.220 - Education    1    21.902 322.54  94.525 + Agriculture  1     4.250 296.39  94.551 + Examination  1     4.175 296.46  94.563 + Catholic     1     2.318 298.32  94.857 - Fertility    1    85.769 386.41 103.017 >  >  > step$anova Stepwise Model Path  Analysis of Deviance Table  Initial Model: Infant.Mortality ~ Fertility + Agriculture + Examination + Education +      Catholic  Final Model: Infant.Mortality ~ Fertility + Education              Step Df     Deviance Resid. Df Resid. Dev      AIC 1                                      41   295.0662 98.34145 2    - Catholic  1 0.0001533995        42   295.0663 96.34147 3 - Examination  1 1.3199421028        43   296.3863 94.55125 4 - Agriculture  1 4.2499886025        44   300.6363 93.22041 >  >  

Summary shows Education also has trend towards significant association:

summary(step)  Call: lm(formula = Infant.Mortality ~ Fertility + Education, data = swiss)  Residuals:     Min      1Q  Median      3Q     Max  -7.6927 -1.4049  0.2218  1.7751  6.1685   Coefficients:             Estimate Std. Error t value Pr(>|t|) (Intercept)  8.63758    3.33524   2.590 0.012973 Fertility    0.14615    0.04125   3.543 0.000951 Education    0.09595    0.05359   1.790 0.080273  Residual standard error: 2.614 on 44 degrees of freedom Multiple R-squared:  0.2296,    Adjusted R-squared:  0.1946  F-statistic: 6.558 on 2 and 44 DF,  p-value: 0.003215 

What do I conclude? Is Education an important predictor or not?

Also, do the graphs using plot(fit) add any significant information?

Thanks for your help.


Edit:
I ran shapiro test on all columns and found 2 are not normally distributed:

Fertility : P= 0.3449466 (Normally distributed)  Agriculture : P= 0.1930223 (Normally distributed)  Examination : P= 0.2562701 (Normally distributed)  Education : P= 1.31202e-07 (--- NOT Normally distributed! ---)  Catholic : P= 1.20461e-07 (--- NOT Normally distributed! ---)  Infant.Mortality : P= 0.4978056 (Normally distributed)  

Does that make a difference?

Stepwise is generally frowned upon – it's been discussed many times here.

However, if you simply compare the two outputs, they are answering different questions so they get different answers. Fertility is significant in both, education is borderline sig. when only fertility and education are included and not close to sig when the other variables are included.

My inclination, barring other information, is that you probably included all these IVs for good reason and I would therefore go with the first model (with all the IVs). However, I'd look for collinearity issues too.

Similar Posts:

Rate this post

Leave a Comment