I am working on a diagnostic test (I'm not a statistician) I used a logit model for regression with the help of another student who has since graduated.
I separated my data into a Training and Test Set. In the training set I have an AUC of 0.94 and a sensitivity/specificity of .87/.92.
In the test set I have an AUC of 0.88 (which seems acceptable) but the sensitivity/specificity is .43/.92 if I use the same threshold as in the training set.
Is it an error to change the threshold? and also how can I describe my data in terms of its ability in diagnosis. What could cause this?
Would it be reasonable to say that "the test seems to have a reproducible specificity but many cases may be missed" or does this mean that the validation completely failed.
Best Answer
If you used k-fold cross validation (CV) you would be in a better position to evaluate these performance metrics. Splitting data into training and testing one time (and then determining performance: sens, spec, AUC) is inefficient and does not provide an optimal measure of performance (see Ron Kohavi's paper on CV). I like to repeat 10-fold CV a total of ten times, whereby each time I shuffle the order of the objects so that the folds contain different objects each time — this is called a $repartition$, and in the end the process is called "ten 10-fold CV."
Sounds like you used CV for feature selection. If you trained using all the objects and then tested with a sub-sample, the classifier learned information from objects that are in the test set, which is biased. CV prevents that from happening. For example, use 10-fold CV. Permute (randomly shuffle) the objects and divide uniformly into 10 folds (partitions). Train with folds 2-10, and test with the 1st fold. Then train with the 1st and 3rd-10th folds, and test with the 2nd fold. Continue until you train with folds 1-9 and test with the 10th fold.
You can average sens, spec, and AUC over the 10 folds, or just keep padding (appending) ones to the confusion matrix throughout testing of the ten folds, and then determine sens, spec, AUC with the single confusion matrix. To understand the latter, for each tested object, a one is added to confusion matrix element in the row representing the true class, and the column representing the predicted class. When done, classification accuracy is the sum of diagonal elements divided by sum of all matrix elements, because diagonal elements represent objects whose true class and predicted class are the same. Sens, spec are just variations on this theme.
Similar Posts:
- Solved – Why is AUC higher for a classifier that is less accurate than for one that is more accurate
- Solved – Where does a ROC curve of a perfect classifier start
- Solved – Where does a ROC curve of a perfect classifier start
- Solved – Cross Validation folds in LibSVM
- Solved – Cross Validation folds in LibSVM