Solved – Sample size and cross-validation methods for Cox regression predictive models

I have a question I would like to pose to the community. I have recently been asked to provide statistical analysis for a tumor marker prognostic study. I have primarily used these two references to guide my analysis:

  1. McShane LM, et al. Reporting recommendations for tumor marker prognostic studies (REMARK). J Natl Cancer Inst. 2005 Aug 17; 97(16):1180-4.

  2. Simon RM, et al. Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Brief Bioinform. 2011 May; 12(3):203-14. Epub 2011 Feb 15.

I have summarized the study and my analyses below. I would appreciate any comments, suggestions, or criticisms.

Study background:

Some patients with cancer X experience early relapse after treatment. The clinical prognostic score currently used by doctors does not do a good job of predicting clinical outcome in these patients. It would therefore be useful to identify biological prognostic markers that add value above and beyond this standard score. The goal of this study is to discover such a biomarker.

Study methods:

Pre-selection of candidate biomarkers

Twelve biomarkers associated with cancer X were identified in a previous study. We attempted to validate the association between these 12 candidates and cancer X in an independent sample of patients/tumors, described below.

Univariate validation of pre-selected candidate biomarkers

Levels of these biomarkers were measured in a set 220 patients/tumors.

[Note: I have masked the data and made them available for public download as a *.csv file. The file has the following columns: “ID”, a unique identifier for each patient; “PS”, the prognostic score for each patient, with 1 indicating a good prognosis and 2 indicating a bad prognosis; “m1” to “m12”, levels of each tumor marker; “time”, in months; and “event”, where 0 indicates that the observation is censured and 1 indicates that treatment failure occurred.]

Univariable Cox regression models with time to death as the dependent variable were built for each of the 12 biomarkers (n = 220 observations, number of events = 91).

    Risk  LCI  UCI pValue 1   0.93 0.86 1.02 0.1088 2   0.93 0.88 0.99 0.0215 3   0.99 0.92 1.05 0.6528 4   0.93 0.87 1.00 0.0468 5   0.93 0.88 0.98 0.0055 6   0.97 0.92 1.01 0.1202 7   0.91 0.83 0.99 0.0297 8   0.98 0.90 1.07 0.6972 9   0.99 0.92 1.06 0.7841 10  1.01 0.91 1.11 0.9149 11  0.96 0.87 1.05 0.3837 12  0.90 0.83 0.97 0.0047 

Using a threshold p value of 0.05/12 = 0.004, none of the results were significant.

Multivariable analyses

It was decided to fit a model to the data by inputting all 12 biomarkers at once into a stepwise Cox regression algorithm using ten-fold cross-validation. After building ten models on the ten different training sets, time-dependent ROC curves were built to allow selection of optimal cutoff points to identify two groups of patients, “high” and “low” risk. Cut points that minimized “1 – TP + FP” were selected. These ten models were then asked to make predictions about the corresponding patients in the validation groups. These patients were then classified into “high” and “low” risk groups and plotted on a single, cross-validated Kaplan Meier curve.


The confidence intervals of the high and low risk curves significantly overlapped, suggesting that the identified biomarkers were not useful prognostic markers. Our study therefore has not identified any significant univariate or multivariate associations between these markers and patient prognosis.

Questions for the community

Have I gone about analyzing my data in the correct manner?

If you had been the statistician on this study, would you have done something differently?

Prior to performing the validation analyses, sample size and power calculations were not performed to determine the number of samples to include and the detectable effect size. I would like to perform these analyses now to guide future studies. Can someone tell me how to do this?

What I am really interested in is whether these biomarkers provide predictive information above and beyond the clinical prognostic score. From what I understand, this would entail making three different models: (1) a model with clinical covariates only, (2) a biomarker model with biomarker covariates only, and (3) a biomarker/clinical model based on both types of covariates. So far I have made models 1 (not shown above; it was unable to differentiate between high and low risk patients in our sample either) and 2 (shown above). Because 1 and 2 were not significant, I didn’t make model 3. Should I do this any way?

Any additional comments about analytical concerns would be greatly appreciated! Please feel free to download the masked data and have a look yourself.

Best Answer

You have nicely described the problem and have set it up well in a number of ways. I wasn't clear on the definition of "prognostic score", but it is very unlikely that a 2-level score is clinically helpful. It is important to adjust for all pertinent available clinical variables, based on expert opinion when choosing them. Here are some opportunities for improvement:

  1. 10-fold cross-validation is unstable and needs to be repeated 100 times to obtain adequate precision (or use the Efron-Gong optimism bootstrap with 400 resamples; both of these are available in the R rms package)
  2. Dividing the signal into "good" and "bad" driven by ROC curves is a popular technique but was not based on any good statistical principles. Any biomarker worth its salt should have a dose-response relationship, and division into two very arbitrary groups is unnecessary, misleading, and information- and power-losing.
  3. ROC curves have absolutely nothing to offer in this context
  4. Choosing cutpoints on the biomarkers is a statistical disaster. Among other things it fails to recognize that mathematically if any cutpoints are useful they can only be on the back end, not on the covariate end, because the cutpoint for each marker depends on the absolute value of all the other marker values for a patient.
  5. Stepwise regression without penalization is not reliable. In your setup there is no reason not to put all the markers into one model and to do a likelihood ratio $chi^2$ test to test the value they add to clinical variables.
  6. A good alternative to 5. is to do a redundancy analysis or variable clustering of the biomarkers to reduce their number before relating them to the outcome.
  7. If your sample size were larger you could allow all the variables to enter into the model nonlinearly using regression splines. Occasionally allowing one biomarker to be smooth and nonlinear doubles its value over forcing linearity.
  8. Let the log likelihood, which is an optimal scoring rule (penalized likelihood would be even better) do its job. Don't spend time on improper accuracy scoring rules.
  9. Consider using the "adequacy index", based on log likelihood, for describing the utility of the biomarkers, as described in my book Regression Modeling Strategies.

Similar Posts:

Rate this post

Leave a Comment