I want to calculate a Vaccine Effectiveness (Influenza) using Hazard Ratios from a Cox PH model. Methods are similar to this article : https://www.ncbi.nlm.nih.gov/pubmed/27813473
I am not so good in mathematics, I have a medical background and I am a R user with statistics basis.
I am using real world data (drug sales). The vaccination variable = vaccine sale 14 days before event
The event is a basket sale combining drugs like paracetamol + anti-cough (unspecific but highly correlated to flu epidemic).
Here is the ouput :
> mod1 <- coxph( > Surv(db3$time, db3$panier) ~ sex + vaccination + age_class, > data = db3 > ) > summary(mod1) Call: coxph(formula = Surv(db3$time, db3$panier) ~ sex + vaccination + age_class, data = db3) n= 2437352, number of events= 211732 coef exp(coef) se(coef) z Pr(>|z|) sex1 -0.132444 0.875952 0.004624 -28.64 <2e-16 *** vaccination -0.405701 0.666509 0.006545 -61.99 <2e-16 *** age_class(10,20] -0.407904 0.665043 0.014687 -27.77 <2e-16 *** age_class(20,30] -0.903083 0.405318 0.012527 -72.09 <2e-16 *** age_class(30,40] -0.948397 0.387362 0.011552 -82.10 <2e-16 *** age_class(40,50] -1.045602 0.351480 0.011140 -93.86 <2e-16 *** age_class(50,60] -1.134455 0.321597 0.010823 -104.82 <2e-16 *** age_class(60,70] -1.303770 0.271506 0.010889 -119.73 <2e-16 *** age_class(70,80] -1.415494 0.242806 0.011778 -120.18 <2e-16 *** age_class(80,90] -1.615479 0.198796 0.013093 -123.39 <2e-16 *** age_class(90,100] -2.024649 0.132040 0.023573 -85.89 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95 sex1 0.8760 1.142 0.8680 0.8839 vaccination 0.6665 1.500 0.6580 0.6751 age_class(10,20] 0.6650 1.504 0.6462 0.6845 age_class(20,30] 0.4053 2.467 0.3955 0.4154 age_class(30,40] 0.3874 2.582 0.3787 0.3962 age_class(40,50] 0.3515 2.845 0.3439 0.3592 age_class(50,60] 0.3216 3.109 0.3148 0.3285 age_class(60,70] 0.2715 3.683 0.2658 0.2774 age_class(70,80] 0.2428 4.119 0.2373 0.2485 age_class(80,90] 0.1988 5.030 0.1938 0.2040 age_class(90,100] 0.1320 7.573 0.1261 0.1383 Concordance= 0.619 (se = 0.001 ) Rsquare= 0.016 (max possible= 0.92 ) Likelihood ratio test= 40343 on 11 df, p=0 Wald test = 42384 on 11 df, p=0 Score (logrank) test = 47349 on 11 df, p=0
Validation
> validation <- cox.zph(mod1) > validation rho chisq p sex1 -0.028494 1.72e+02 0.00e+00 vaccination 0.210335 9.30e+03 0.00e+00 age_class(10,20] 0.009616 1.96e+01 9.65e-06 age_class(20,30] -0.005949 7.50e+00 6.18e-03 age_class(30,40] -0.004818 4.92e+00 2.66e-02 age_class(40,50] -0.000899 1.71e-01 6.79e-01 age_class(50,60] 0.007783 1.28e+01 3.45e-04 age_class(60,70] 0.001600 5.40e-01 4.62e-01 age_class(70,80] -0.013171 3.66e+01 1.47e-09 age_class(80,90] -0.021582 9.90e+01 0.00e+00 age_class(90,100] -0.018040 6.92e+01 1.11e-16 GLOBAL NA 1.05e+04 0.00e+00
My question
How can I validate this model ? I have read that p-values can be messy when using large datasets… The cox.zph
method with variables > 0.05 is not an option here.
I tested the model with the variable sex, I cant validate the model either.
I want to know if I can use the Hazard ratio despite the fact I cant validate the model.
Or maybe I have to use graphical methods ?
Thank you for your help
Best Answer
This is a well known problem in all of literature. You simply cannot calibrate global tests on such large data. The truth is that the proportional hazards assumption is probably violated. In fact, I am certain that such an assumption never holds in any circumstance ever: we have just lacked adequate data to address it. Ask yourself why it even matters: what you want is valid inference i.e. do the 95% CIs actually reflect the uncertainty in the design and random outcomes that would be encountered if we replicated the study?
To address this, you can use robust standard errors. In the coxph
function, this is achieved by using robust=TRUE
. Some foundational work by Lagakos and Wei at Harvard showed that, when you use these robust standard errors, even when the proportional hazards assumption is violated, the Cox model coefficients converge to a useful measure of association that can be used for inference and tests.
With the robust error, the interpretation of the (exponentiated) model coefficient is the failure time averaged hazard ratio. So if the HR generally bounces between 1.2 and 1.5, your estimate may fall in that range (say 1.35) and you can collect compelling evidence that this varying ratio is, on average, higher than 1.
To further validate these findings, consider using graphics like plots of the Schoenfeld residuals and the Kaplan Meier curve to see what the shape of the trend is. If in fact the hazard ratio varies from 0.5 to 2 the average could be 1 (no association) but the reality is that the trend is harmful then beneficial and that is an interesting finding.
Lagakos SW, Schoenfeld DA. "Properties of proportional-hazards score tests under misspecified regression models." Biometrics. 1984 Dec;40(4):1037-48.
D. Y. Lin & L. J. Wei "The Robust Inference for the Cox Proportional Hazards Model" Journal of the American Statistical Association Volume 84, 1989 – Issue 408
Similar Posts:
- Solved – different HRs and corresponding CIs with cph and coxph
- Solved – Using strata instead of stratifying groups separately
- Solved – Hazard ratio for more than two groups
- Solved – Multivariable survival analysis: adding another variable lowers the p value
- Solved – Interpret hazard ratio that has huge value