I have a sample of patients who underwent 2 diagnostic tests, one of which is the gold standard. I've been asked to test if the diagnosis from the experimental test is different than the one from the gold standard. Here is the contingency table:
> table(test=db$exptest, gold=db$goldstandard) gold test sick healthy sick 7 4 healthy 8 27
At first I thought this was a case for McNemar's test, but it seems this test is meant to compare 2 experimental test results against the gold standard, as seen here, or here. Nevertheless, the wiki page or this post are not that exclusive, but not very clear for my case though.
Should I perform McNemar's test in this specific case? If not, can I test this hypothesis and how?
Best Answer
If you use McNemar's test you are testing whether the table is symmetric: whether more people are diagnosed sick by the new method and well by the old versus well by the new and sick by the old. This is a perfectly reasonable scientific question to have.For a concrete situation suppose the two methods being compared are ratings of mental health problems by a psychiatrist and by a family physician. Since they see different case mix in their practice you might ask whether this affects their threshold for declaring someone ill.
If you use Cohen's kappa you are evaluating whether agreement between the methods is more than would be expected by chance. This again a perfectly reasonable question to have but it is different. So if you are comparing two methods for diagnosing mild cognitive impairment where there is no gold standard you might treat agreement between methods as justifying the concept of MCI and if they disagree you might wonder whether it is a useful diagnosis at all.
Calculating sensitivity and specificity is the usual method for diagnostic tests and evaluates the performance separately in the two groups: well according to the gold standard and sick according to the gold standard. Again this is a reasonable thing to do but it is different from the other two. In this case you have two separate things which you are interested in and your focus in a practical situation might be on one or the other. For instance if you are screening for a fatal disease you might want a test with high sensitivity since you do not want to miss cases. On the other hand if you are recruiting into a trial you might not mind missing a few but on cost grounds you might want high specificity since you do not want o do the full diagnostic work-up on more people than is absolutely essential.