It is a basic question but I could not find clear answer on my reading. I am trying to find independent predictors of Infant.Mortality in data frame 'swiss' in R.
> head(swiss) Fertility Agriculture Examination Education Catholic Infant.Mortality Courtelary 80.2 17.0 15 12 9.96 22.2 Delemont 83.1 45.1 6 9 84.84 22.2 Franches-Mnt 92.5 39.7 5 5 93.40 20.2 Moutier 85.8 36.5 12 7 33.77 20.3 Neuveville 76.9 43.5 17 15 5.16 20.6 Porrentruy 76.1 35.3 9 7 90.57 26.6
Following are the results using lm and I find only Fertility to be a significant predictor:
> fit = lm(Infant.Mortality~., data=swiss) > summary(fit) Call: lm(formula = Infant.Mortality ~ ., data = swiss) Residuals: Min 1Q Median 3Q Max -8.2512 -1.2860 0.1821 1.6914 6.0937 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.667e+00 5.435e+00 1.595 0.11850 Fertility 1.510e-01 5.351e-02 2.822 0.00734 # <<<< NOTE P VALUE HERE Agriculture -1.175e-02 2.812e-02 -0.418 0.67827 Examination 3.695e-02 9.607e-02 0.385 0.70250 Education 6.099e-02 8.484e-02 0.719 0.47631 Catholic 6.711e-05 1.454e-02 0.005 0.99634 Residual standard error: 2.683 on 41 degrees of freedom Multiple R-squared: 0.2439, Adjusted R-squared: 0.1517 F-statistic: 2.645 on 5 and 41 DF, p-value: 0.03665
Following are the graphs:
plot(fit)
On performing stepwise regression, following are the results:
> step <- stepAIC(fit, direction="both"); Start: AIC=98.34 Infant.Mortality ~ Fertility + Agriculture + Examination + Education + Catholic Df Sum of Sq RSS AIC - Catholic 1 0.000 295.07 96.341 - Examination 1 1.065 296.13 96.511 - Agriculture 1 1.256 296.32 96.541 - Education 1 3.719 298.79 96.930 <none> 295.07 98.341 - Fertility 1 57.295 352.36 104.682 Step: AIC=96.34 Infant.Mortality ~ Fertility + Agriculture + Examination + Education Df Sum of Sq RSS AIC - Examination 1 1.320 296.39 94.551 - Agriculture 1 1.395 296.46 94.563 - Education 1 5.774 300.84 95.252 <none> 295.07 96.341 + Catholic 1 0.000 295.07 98.341 - Fertility 1 72.609 367.68 104.681 Step: AIC=94.55 Infant.Mortality ~ Fertility + Agriculture + Education Df Sum of Sq RSS AIC - Agriculture 1 4.250 300.64 93.220 - Education 1 6.875 303.26 93.629 <none> 296.39 94.551 + Examination 1 1.320 295.07 96.341 + Catholic 1 0.255 296.13 96.511 - Fertility 1 79.804 376.19 103.758 Step: AIC=93.22 Infant.Mortality ~ Fertility + Education Df Sum of Sq RSS AIC <none> 300.64 93.220 - Education 1 21.902 322.54 94.525 + Agriculture 1 4.250 296.39 94.551 + Examination 1 4.175 296.46 94.563 + Catholic 1 2.318 298.32 94.857 - Fertility 1 85.769 386.41 103.017 > > > step$anova Stepwise Model Path Analysis of Deviance Table Initial Model: Infant.Mortality ~ Fertility + Agriculture + Examination + Education + Catholic Final Model: Infant.Mortality ~ Fertility + Education Step Df Deviance Resid. Df Resid. Dev AIC 1 41 295.0662 98.34145 2 - Catholic 1 0.0001533995 42 295.0663 96.34147 3 - Examination 1 1.3199421028 43 296.3863 94.55125 4 - Agriculture 1 4.2499886025 44 300.6363 93.22041 > >
Summary shows Education also has trend towards significant association:
summary(step) Call: lm(formula = Infant.Mortality ~ Fertility + Education, data = swiss) Residuals: Min 1Q Median 3Q Max -7.6927 -1.4049 0.2218 1.7751 6.1685 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.63758 3.33524 2.590 0.012973 Fertility 0.14615 0.04125 3.543 0.000951 Education 0.09595 0.05359 1.790 0.080273 Residual standard error: 2.614 on 44 degrees of freedom Multiple R-squared: 0.2296, Adjusted R-squared: 0.1946 F-statistic: 6.558 on 2 and 44 DF, p-value: 0.003215
What do I conclude? Is Education an important predictor or not?
Also, do the graphs using plot(fit) add any significant information?
Thanks for your help.
Edit:
I ran shapiro test on all columns and found 2 are not normally distributed:
Fertility : P= 0.3449466 (Normally distributed) Agriculture : P= 0.1930223 (Normally distributed) Examination : P= 0.2562701 (Normally distributed) Education : P= 1.31202e-07 (--- NOT Normally distributed! ---) Catholic : P= 1.20461e-07 (--- NOT Normally distributed! ---) Infant.Mortality : P= 0.4978056 (Normally distributed)
Does that make a difference?
Best Answer
Stepwise is generally frowned upon – it's been discussed many times here.
However, if you simply compare the two outputs, they are answering different questions so they get different answers. Fertility is significant in both, education is borderline sig. when only fertility and education are included and not close to sig when the other variables are included.
My inclination, barring other information, is that you probably included all these IVs for good reason and I would therefore go with the first model (with all the IVs). However, I'd look for collinearity issues too.