Frank Harrell describes the concordance (or Somer's D) as not being sensitive enough to compare multiple survival models for their diagnostic ability, and I've observed this in my own work with multiple candidates providing very similar concordance scores. In this answer he provides an approach for comparing models with varying numbers of predictors.

However, what is the most effective evaluation criteria to use when you want to compare quite different models? I.e. a standard Cox model vs a gradient boosted one, or one using the raw predictors vs a model having transformed them.

The survAUC R package provides a multitude of ways to compare models, with the AUC approaches providing the largest difference between models in my cursory look, however there are several implementations to choose from.

In Classification one can use accuracy or AUC to aid model selection or MSE in regression problems, what would be the most useful measure for survival analysis?

**Contents**hide

#### Best Answer

I think it's useful to consider transformations of predictors separately from choices of models. For transformations of predictors, the task is to find transformations that give well behaved residuals. That needs to be done regardless of how you then choose to combine those predictors into a model.

In terms of choosing among models you want the "most predictive ability." That needs to be a bit better defined, in terms of how the model will be used for prediction in practice. There may be different costs to different types of errors. Is the practical cost of under-estimating survival really the same as that of over-estimating survival? That assumption hides in the use of AUC or concordance as a measure of model quality, yet it seldom is true in practice. That may be even more of a reason to avoid such measures than is lack of sensitivity for discriminating among models.

There also may be costs in acquiring information about some predictor variables that would need to be taken into account in practice. Develop a metric that takes all the costs (monetary and prediction-error) and benefits into account. If you don't consider those explicitly, your analysis will be based on hidden assumptions that might be contrary to your ultimate goals.

To get the "most predictive ability" as thus re-defined, you need a model that will generalize well to the underlying population rather than simply fit the sample you have. So once you have developed your net-cost metric, choose the modeling *scheme* that minimizes that cost as evaluated by cross-validation or bootstrapping. This evaluation should include all steps in the model building. Then apply the best scheme to your data.

This approach isn't as simple as evaluating a pre-defined metric like AUC or AIC, but it should be more aligned with your ultimate goal of developing a generally useful predictive model. Finally, in case you've missed it, there is a new edition of Harrell's Regression Modeling strategies, which is a rich source of information and practical advice.