Scott's Pi
Scott's pi (named after William A Scott) is a statistic for measuring inter-rater reliability for nominal data in communication studies. Textual entities are annotated with categories by different annotators, and various measures are used to assess the extent of agreement between the annotators, one of which is Scott's pi. Since automatically annotating text is a popular problem in natural language processing, and the goal is to get the computer program that is being developed to agree with the humans in the annotations it creates, assessing the extent to which humans agree with each other is important for establishing a reasonable upper limit on computer performance.
Introduction
Scott's pi is similar to Cohen's kappa in that they both improve on simple observed agreement by factoring in the extent of agreement that might be expected by chance. However, in each statistic, the expected agreement is calculated slightly differently. Scott's pi compares to the baseline of the annotators being not only independent but also having the same distribution of responses; Cohen's kappa compares to a baseline in which the annotators are assumed to be independent but to have their own, different distributions of responses. Thus, Scott's pi measures disagreements between the annotators relative to the level of agreement expected due to pure random chance if the annotators were independent and identically distributed, whereas Cohen's kappa measures disagreements between the annotators that are above and beyond any systematic, average disagreement that the annotators might have. Indeed, Cohen's kappa explicitly ignores all systematic, average disagreement between the annotators prior to comparing the annotators. So Cohen's kappa assesses only the level of randomly varying disagreements between the annotators, not systematic, average disagreements. Scott's pi is extended to more than two annotators by Fleiss' kappa.
The equation for Scott's pi, as in Cohen's kappa, is:
However, Pr(e) is calculated using squared "joint proportions" which are squared arithmetic means of the marginal proportions (whereas Cohen's uses squared geometric means of them).
Worked example
Confusion matrix for two annotators, three categories {Yes, No, Maybe} and 45 items rated (90 ratings for 2 annotators):
| Yes | No | Maybe | Marginal Sum | |
| Yes | 1 | 2 | 3 | 6 | 
| No | 4 | 5 | 6 | 15 | 
| Maybe | 7 | 8 | 9 | 24 | 
| Marginal Sum | 12 | 15 | 18 | 45 | 
To calculate the expected agreement, sum marginals across annotators and divide by the total number of ratings to obtain joint proportions. Square and total these:
| Ann1 | Ann2 | Joint Proportion | JP Squared | |
| Yes | 12 | 6 | (12 + 6)/90 = 0.2 | 0.04 | 
| No | 15 | 15 | (15 + 15)/90 = 0.333 | 0.111 | 
| Maybe | 18 | 24 | (18 + 24)/90 = 0.467 | 0.218 | 
| Total | 0.369 | 
To calculate observed agreement, divide the number of items on which annotators agreed by the total number of items. In this case,
Given that Pr(e) = 0.369, Scott's pi is then
See also
References
- Scott, W. (1955). "Reliability of content analysis: The case of nominal scale coding." Public Opinion Quarterly, 19(3), 321–325.
- Krippendorff, K. (2004b) “Reliability in content analysis: Some common misconceptions and recommendations.” in Human Communication Research. Vol. 30, pp. 411–433.