Kappa is a nonparametric test that can be used to measure interobserver agreement on imaging studies. Cohen's kappa compares two observers, or in the case of machine learning can be used to compare a specific algorithm's output versus labels. Fleiss' kappa assesses interobserver agreement between more than two observers.
If comparing two observers, the concept behind the test is similar to the chi-squared test. Two 2 x 2 tables are set up: one with the expected values if there were chance agreement, and one with your actual data. Kappa will indicate how much of your interobserver agreement was due to chance.
To find the expected values, find the product of the marginals:
To find the expected value for the +/+ cell: [(O1 + O2) x (O1 +O3)] / total observations
To find the expected value for the -/- cell: [(O3 + O4) x (O2 +O4)] / total observations.
Rating systems for kappa are controversial, as they cannot be proven, but one system classifies kappa values as
- >0.75: excellent
- 0.40-0.75: fair to good
- <0.40: poor
Kappa can be extrapolated out to 3+ readers using more elaborate equations. Kappa in that setting assesses if all radiologists involved agree on a finding (more stringent).
Kappa is used for categorical values (e.g. larger vs. smaller, has condition vs. does not have the condition). The Bland-Altman analysis is used for continuous variables.