In the Python statsmodels documentation there is an example with the goal:
We want to know whether literacy rates (Literacy column) in the 85 French departments (Departments) are associated with per capita wagers on the Royal Lottery (Lottery) in the 1820s. We need to control for the level of wealth (Wealth) in each department, and we also want to include a series of dummy variables on the right-hand side of our regression equation to control for unobserved heterogeneity due to regional effects (Region; N, E, S, W to 0 or 1). The model is estimated using ordinary least squares regression (OLS).
OLS Regression Results ============================================================================== Dep. Variable: Lottery R-squared: 0.338 Model: OLS Adj. R-squared: 0.287 Method: Least Squares F-statistic: 6.636 Date: Tue, 02 Feb 2021 Prob (F-statistic): 1.07e-05 Time: 07:07:06 Log-Likelihood: -375.30 No. Observations: 85 AIC: 764.6 Df Residuals: 78 BIC: 781.7 Df Model: 6 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 38.6517 9.456 4.087 0.000 19.826 57.478 Region[T.E] -15.4278 9.727 -1.586 0.117 -34.793 3.938 Region[T.N] -10.0170 9.260 -1.082 0.283 -28.453 8.419 Region[T.S] -4.5483 7.279 -0.625 0.534 -19.039 9.943 Region[T.W] -10.0913 7.196 -1.402 0.165 -24.418 4.235 Literacy -0.1858 0.210 -0.886 0.378 -0.603 0.232 Wealth 0.4515 0.103 4.390 0.000 0.247 0.656 ============================================================================== Omnibus: 3.049 Durbin-Watson: 1.785 Prob(Omnibus): 0.218 Jarque-Bera (JB): 2.694 Skew: -0.340 Prob(JB): 0.260 Kurtosis: 2.454 Cond. No. 371. ==============================================================================
Prob (F-statistic), 1.07e-05, thus reject null hypothesis (H0: all coefficients are equal to zero), so there is statistically significant evidence that there is a relationship between dependent and independent variables together. But only Wealth has a p-value < 0.05.
Should the model be used as is? Or should all independent variables except Wealth be removed? What should be done based on the goal "We want to know whether literacy … We need to control for the level of wealth (Wealth) in each department …"?
Best Answer
Assuming that there are no problems with model assumptions, the model should be used as it is. Insignificant variables should not be removed. Removing them would invalidate any tests that are run within the reduced models. (Removing insignificant variables seems to be a common practice, but that doesn't make it better. Occasionally there are reasons such as removing variables that are potentially expensive to observe in the future when using the model for prediction, or that the number of observations is too small for fitting a full model with reasonable reliability, but I don't see such reasons here; even in such cases there are often better criteria than significance.)