Assume that I am interested in analyzing the following linear regression model:

$$

Y = beta_0 +beta_1 x_1 +beta_2 x_2+e

$$

Please explain the difference between testing the p-value for each coefficient $beta_i$ separately, and performing a **goodness-of-fit** test for the model?

In particular:

Is it true to say that the p-value for each coefficient corresponds to the null hypothesis that this coefficient is actually zero (for example, in MATLAB's

`glmfit`

function)?Is it possible that a model resulting in a really good fit will have high p-values for all the coefficients? Is it possible that a model with low p-values for all the coefficients will result in a poor fit?

**Contents**hide

#### Best Answer

*Yes*, the p-values that come with standard regression output are testing if the associated beta (slope coefficient) is $0$. (It is possible to get p-values for tests against other values, but you have to know how to set that up—it isn't what software does by default, and it really isn't very common.)*Yes*, you can have high p-values for individual coefficients with a good fit and low p-values with a poor fit. The reason for this is straightforward: goodness of fit is a different question than whether the slope of the $X, Y$ relationship is $0$ in the population. Generally, when running a regression, we are trying to determine a fitted line that traces the conditional means of $Y$ at different values of $X$. (It is also possible to wonder about other aspects of a model, but that is the most basic and common feature.) Thus, a goodness of fit assessment is whether the model's fitted conditional means actually match the data's conditional means. The answer to this latter question can be either*yes*or*no*independently of whether the best estimate of the slope is $0$.Consider the following examples, which are coded in R. (I don't have access to MATLAB, but the code here is intended to be as close to pseudocode as I can make it.)

`##### high p-value, good fit set.seed(6462) # this makes the example exactly reproducible x1 = runif(100, min=-5, max=5) # the x-variables are uniformly distributed x2 = runif(100, min=-5, max=5) # between -5 and 5 e = rnorm(100, mean=0, sd=1) # these are the errors y = 0 + 0*x1 + 0*x2 + e # the true intercept & sloes are 0 m1 = lm(y~x1+x2) summary(m1) # ... # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -0.1257881 0.0992355 -1.268 0.208 # these p-values are # x1 0.0009124 0.0307466 0.030 0.976 # high & non-significant # x2 -0.0243975 0.0316458 -0.771 0.443 # # Residual standard error: 0.9884 on 97 degrees of freedom # Multiple R-squared: 0.006149, Adjusted R-squared: -0.01434 # F-statistic: 0.3001 on 2 and 97 DF, p-value: 0.7415 # the whole model is ns`

`##### low p-values, poor fit # the true intercept & sloes are not 0, but the relationships are curvilinear y2 = 5 + 0.65*x1 + -0.17*x1^2 + 0.65*x2 + -0.17*x2^2 + e m2 = lm(y2~x1+x2) summary(m2) # ... # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.42633 0.21650 6.588 2.31e-09 *** # very low p-values # x1 0.64189 0.06708 9.569 1.14e-15 *** # x2 0.58869 0.06904 8.527 2.01e-13 *** # ... # # Residual standard error: 2.156 on 97 degrees of freedom # Multiple R-squared: 0.6152, Adjusted R-squared: 0.6073 # F-statistic: 77.54 on 2 and 97 DF, p-value: < 2.2e-16`

What these examples show are a model that has high / non-significant p-values, but a good fit for the predicted means (because the true slopes are $0$), and a model with very low / highly significant p-values, but a poor fit for the predicted means (because, although the slopes within the regions spanned by the data are far from $0$, they are also not very close to straight lines). The p-values are easy to see and understand in the output. To see the quality of the models' fits to the conditional means, I plotted the true data generating process (in this case I have it, because the data are simulated, but in general you won't). In a more typical case, you would just see if the predicted means do a reasonable job of tracing the observed conditional means in your dataset; here I did that by plotting LOWESS lines. (The plots only display

`x1`

, and collapse over`x2`

, but I could make analogous plots with`x2`

, or various kinds of fancy plots with both`x1`

and`x2`

, and they would show the same thing.)

### Similar Posts:

- Solved – Random Intercept model vs. GEE
- Solved – Adding log odds for combined probability from logistic regression coefficients
- Solved – prcomp() vs lm() results in R
- Solved – Ridge Regression in R where coefficients are penalized toward numbers other than zero
- Solved – Why is the intercept in multiple regression changing when including/excluding regressors