Solved – Are there good reasons to use Cohen’s Kappa over Gwet’s AC1

Stats newbie here, please bear with me.

I've been researching inter-rater reliability in order to figure out what I will be using for my thesis study. The most common used test for this in my field/department is Cohen's Kappa. While researching it, I ran into varied papers criticizing it and I am also seeing questions here about how to report on seemingly strange Kappa values.

I ran into a paper written by Gwet in 2002 "Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement Between Raters" and it shed some light as well as introduced a different alternative (AC1).

However, Cohen's Kappa still seems to be much, much more prevalent. Therefore, I am wondering if I am missing something, hence my question.

Are there good reasons to use Cohen's Kappa over Gwet's AC1?

I would argue that Cohen's kappa and Gwet's gamma are both problematic approaches to the estimation of inter-rater reliability. Both make a number of assumptions about the behavior of raters that are rarely tenable in practice and produce paradoxical results when these assumptions are violated (Feng, 2013; Zhao, Liu, & Deng, 2012). The popularity of Cohen's kappa stems largely from historical precedent/inertia and the "intuitive appeal" of its logic (i.e., applying Bayes' rule to the estimation of chance agreement). The resistance of the field to change to Gwet's gamma is probably more due to sociological reasons than statistical ones, although I think statistical arguments against Gwet's gamma could be made (see again the cited articles). Unfortunately, a truly effective alternative has not yet been developed. At the moment, your best bet is probably to report multiple measures. As @Alexis stated in her comments, one attractive option is to use specific agreement for each category as suggested by Cicchetti & Feinstein (1990).

References

Cicchetti, D. V, & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551–558.

Feng, G. C. (2013). Factors affecting intercoder reliability: A Monte Carlo experiment. Quality & Quantity, 47(5), 2959–2982.

Zhao, X., Liu, J. S., & Deng, K. (2012). Assumptions behind inter-coder reliability indices. In C. T. Salmon (Ed.), Communication Yearbook (pp. 418–480). Routledge.

Similar Posts:

Rate this post

Leave a Comment