Consider that we have a problem with 4 variables (y, x1, x2 and x3) and we want to do a multiple linear regression model. As we need to know which variables are the most important in the problem, we look for it with a step selection as (it's just an example, we could also used back, both…) :
g0 = lm(Y~1,data=dat) gxf = formula(gx) forward=step(g0,scope=gxf,direction="forward",test="F")
Suppose that this function says to us that our model should be y ~ ax1 + bx3. If we now do a summary to the object "forward" and we get this:
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.071923 0.150266 0.479 0.636 X1 0.009716 0.001890 5.140 2.09e-05 *** X3 -0.013497 0.009230 -1.462 0.155
Do we should change our model to y ~ x1? Why isn't significative x3? And in case we change to only y ~ x1, if we do a lm(y~x3) and in a summary of this model now x3 is also significative, what model is better? The one that have a better r^2?
Stepwise variable selection is not a good practice for variable selection in linear regression because standard errors and p values are biased toward zero due to the many multiple comparisons.
I'm guessing you begin with the three variables because they were all of substantive interest? If so, with only three variables, use them all. You are not likely to have collinearity issues with only three predictors unless two of them are very similar, in which case you should consider combining them somehow. Remember that an insignificant result on a variable you thought would be a significant predictor of Y can be interesting too!
Finally, to your question about the lack of significance of X3 and picking models with better R^2, the step() procedure in R picks the combination with the best AIC score (or BIC if you change k to log(n)). Thus, it isn't concerned with p values or R^2, only AIC. Of course, those will be related but may not be in direct alignment.
- Solved – How to perform step() when n < p in R
- Solved – Feature selection : Mutual information between 2 features or between feature and target
- Solved – Forward-backward model selection: What is the starting model
- Solved – forward selection with mixed model using lmer
- Solved – Why Feature Selection with sklearn.feature_selection.SequentialFeatureSelector is a preprocessing task