I've been reading a lot these last few days about how to get a p-value from bootstrap for regression models (not by permutation). For each coefficient of the model, the null hypothesis is that the coefficient equals 0 and H1 is that it is different to 0 (bilateral test).
The most noticeable similar subjects are the following two questions on stackexchange, but the answers confuse me a lot:
- How to obtain p-values of coefficients from bootstrap regression?
- Computing p-value using bootstrap with R
From what I've read, I noticed several approaches, but I can't figure out which is the valid one:
Should my bootstrap function return the test statistic calculated for each sample, or the estimate?
Should I calculate the proportion of the test statistic/estimate above 0 or above the point estimate of the base model?
Should I multiply the result by 2 because the test is bilateral or use absolute values?
Best Answer
Suppose we want to test the null hypothesis that a regression coefficient = 0 using bootstrap, and say we decide 0.05 to be the level of significance. Now, we can generate the sampling distribution for each coefficient using bootstrap. It is easy to check if 0 falls within 95% confidence interval, thus we can easily decide whether we can reject the null or not.
To get a p-value, we need to check what is the quantile value of 0 in the sampling distribution. (I am using a quantile based approach there are other methods to do it which can be found here Fox on Regression) After you get the quantile,Q, the p-value is 2*Q or 2*(1-Q) depending on whether Q > 0.5 or less than 0.5.
As an illustration of the approach, consider this
library(faraway)
Build linear model
mdl <- lm(divorce ~ ., data = divusa)
Bootstrap
bootTest <- sapply(1:1e4,function(x){ rows <- sample(nrow(divusa),nrow(divusa),replace = T) mdl <- lm(divorce ~ ., data = divusa[rows,]) return(mdl$coefficients) })
Here is the model
summary(mdl) Call: lm(formula = divorce ~ ., data = divusa) Residuals: Min 1Q Median 3Q Max -2.9087 -0.9212 -0.0935 0.7447 3.4689 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 380.14761 99.20371 3.832 0.000274 *** year -0.20312 0.05333 -3.809 0.000297 *** unemployed -0.04933 0.05378 -0.917 0.362171 femlab 0.80793 0.11487 7.033 1.09e-09 *** marriage 0.14977 0.02382 6.287 2.42e-08 *** birth -0.11695 0.01470 -7.957 2.19e-11 *** military -0.04276 0.01372 -3.117 0.002652 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.513 on 70 degrees of freedom Multiple R-squared: 0.9344, Adjusted R-squared: 0.9288 F-statistic: 166.2 on 6 and 70 DF, p-value: < 2.2e-16
Notice the p-values
Subset of regression coefficient generated using bootstrap
bootTest[,1:5] [,1] [,2] [,3] [,4] [,5] (Intercept) 335.70970574 372.525260160 569.85830341 338.70069977 344.69261238 year -0.18107568 -0.201798080 -0.30579380 -0.18125215 -0.18328105 unemployed -0.02916575 0.006828023 0.01197723 -0.05610887 -0.11463230 femlab 0.79078784 0.842924808 1.02607863 0.77527548 0.76472406 marriage 0.17372382 0.199571033 0.18782967 0.15289119 0.15693996 birth -0.11613752 -0.118507758 -0.11998122 -0.11666450 -0.13344442 military -0.04051730 -0.056277118 -0.04062756 -0.05167556 -0.07251748
Generate p-values with bootstrap
pvals <- sapply(1:nrow(bootTest),function(x) { distribution <- ecdf(bootTest[x,]) qt0 <- distribution(0) if(qt0 < 0.5){ return(2*qt0) } else { return(2*(1-qt0)) } })
Comparing p-values from t-test and bootstrap
T test
summary(mdl)$coefficients[,4] (Intercept) year unemployed femlab marriage birth military 2.744830e-04 2.966776e-04 3.621708e-01 1.085196e-09 2.419284e-08 2.191964e-11 2.652003e-03
Bootstrap
> pvals [1] 0.0008 0.0008 0.2196 0.0000 0.0000 0.0000 0.0188
The highly significant p-values with coefficients < 1e-8 are all 0 with bootstrap with 1e4 iterations. Furthermore, the ranking of p-values is comparable as well.
Similar Posts:
- Solved – After bootstrapping regression analysis, all p-values are multiple of 0.001996
- Solved – Wild cluster bootstrap after linear model with multiple fixed effects
- Solved – Bootstrapped coefficients for ordinal logistic regression with R
- Solved – Comparing coefficients of two variables: one is significant, the other is not significant
- Solved – Bootstrap p-value