# Solved – Cox PH model validation on a large dataset

I want to calculate a Vaccine Effectiveness (Influenza) using Hazard Ratios from a Cox PH model. Methods are similar to this article : https://www.ncbi.nlm.nih.gov/pubmed/27813473

I am not so good in mathematics, I have a medical background and I am a R user with statistics basis.

I am using real world data (drug sales). The vaccination variable = vaccine sale 14 days before event
The event is a basket sale combining drugs like paracetamol + anti-cough (unspecific but highly correlated to flu epidemic).

Here is the ouput :

``> mod1 <- coxph( >         Surv(db3\$time, db3\$panier) ~  sex + vaccination + age_class, >         data = db3 >         ) >         summary(mod1)    Call:     coxph(formula = Surv(db3\$time, db3\$panier) ~ sex + vaccination +      age_class, data = db3)    n= 2437352, number of events= 211732                          coef exp(coef)  se(coef)       z Pr(>|z|)     sex1              -0.132444  0.875952  0.004624  -28.64   <2e-16 *** vaccination       -0.405701  0.666509  0.006545  -61.99   <2e-16 *** age_class(10,20]  -0.407904  0.665043  0.014687  -27.77   <2e-16 *** age_class(20,30]  -0.903083  0.405318  0.012527  -72.09   <2e-16 *** age_class(30,40]  -0.948397  0.387362  0.011552  -82.10   <2e-16 *** age_class(40,50]  -1.045602  0.351480  0.011140  -93.86   <2e-16 *** age_class(50,60]  -1.134455  0.321597  0.010823 -104.82   <2e-16 *** age_class(60,70]  -1.303770  0.271506  0.010889 -119.73   <2e-16 *** age_class(70,80]  -1.415494  0.242806  0.011778 -120.18   <2e-16 *** age_class(80,90]  -1.615479  0.198796  0.013093 -123.39   <2e-16 *** age_class(90,100] -2.024649  0.132040  0.023573  -85.89   <2e-16 *** --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1                    exp(coef) exp(-coef) lower .95 upper .95 sex1                 0.8760      1.142    0.8680    0.8839 vaccination          0.6665      1.500    0.6580    0.6751 age_class(10,20]     0.6650      1.504    0.6462    0.6845 age_class(20,30]     0.4053      2.467    0.3955    0.4154 age_class(30,40]     0.3874      2.582    0.3787    0.3962 age_class(40,50]     0.3515      2.845    0.3439    0.3592 age_class(50,60]     0.3216      3.109    0.3148    0.3285 age_class(60,70]     0.2715      3.683    0.2658    0.2774 age_class(70,80]     0.2428      4.119    0.2373    0.2485 age_class(80,90]     0.1988      5.030    0.1938    0.2040 age_class(90,100]    0.1320      7.573    0.1261    0.1383  Concordance= 0.619  (se = 0.001 ) Rsquare= 0.016   (max possible= 0.92 ) Likelihood ratio test= 40343  on 11 df,   p=0 Wald test            = 42384  on 11 df,   p=0 Score (logrank) test = 47349  on 11 df,   p=0 ``

Validation

``> validation <- cox.zph(mod1) > validation                         rho    chisq        p sex1              -0.028494 1.72e+02 0.00e+00 vaccination        0.210335 9.30e+03 0.00e+00 age_class(10,20]   0.009616 1.96e+01 9.65e-06 age_class(20,30]  -0.005949 7.50e+00 6.18e-03 age_class(30,40]  -0.004818 4.92e+00 2.66e-02 age_class(40,50]  -0.000899 1.71e-01 6.79e-01 age_class(50,60]   0.007783 1.28e+01 3.45e-04 age_class(60,70]   0.001600 5.40e-01 4.62e-01 age_class(70,80]  -0.013171 3.66e+01 1.47e-09 age_class(80,90]  -0.021582 9.90e+01 0.00e+00 age_class(90,100] -0.018040 6.92e+01 1.11e-16 GLOBAL                   NA 1.05e+04 0.00e+00 ``
Contents

## My question

How can I validate this model ? I have read that p-values can be messy when using large datasets… The `cox.zph` method with variables > 0.05 is not an option here.
I tested the model with the variable sex, I cant validate the model either.

I want to know if I can use the Hazard ratio despite the fact I cant validate the model.

Or maybe I have to use graphical methods ?

This is a well known problem in all of literature. You simply cannot calibrate global tests on such large data. The truth is that the proportional hazards assumption is probably violated. In fact, I am certain that such an assumption never holds in any circumstance ever: we have just lacked adequate data to address it. Ask yourself why it even matters: what you want is valid inference i.e. do the 95% CIs actually reflect the uncertainty in the design and random outcomes that would be encountered if we replicated the study?

To address this, you can use robust standard errors. In the `coxph` function, this is achieved by using `robust=TRUE`. Some foundational work by Lagakos and Wei at Harvard showed that, when you use these robust standard errors, even when the proportional hazards assumption is violated, the Cox model coefficients converge to a useful measure of association that can be used for inference and tests.

With the robust error, the interpretation of the (exponentiated) model coefficient is the failure time averaged hazard ratio. So if the HR generally bounces between 1.2 and 1.5, your estimate may fall in that range (say 1.35) and you can collect compelling evidence that this varying ratio is, on average, higher than 1.

To further validate these findings, consider using graphics like plots of the Schoenfeld residuals and the Kaplan Meier curve to see what the shape of the trend is. If in fact the hazard ratio varies from 0.5 to 2 the average could be 1 (no association) but the reality is that the trend is harmful then beneficial and that is an interesting finding.

Lagakos SW, Schoenfeld DA. "Properties of proportional-hazards score tests under misspecified regression models." Biometrics. 1984 Dec;40(4):1037-48.

D. Y. Lin & L. J. Wei "The Robust Inference for the Cox Proportional Hazards Model" Journal of the American Statistical Association  Volume 84, 1989 – Issue 408

Rate this post