AFinLA 2026 Symposium · Pilot Study
Linguistics Student · Lead Annotator · n=300
No English Specialization · n=49
University Lecturer · Original Study Author · n=49
English Major · University Student · n=49
Google AI · Zero-Shot · n=281
OpenAI · Zero-Shot · n=246
Click a profile to highlight their pairwise comparisons in the matrix below.
Peer Rater C (English Major) and Gemini achieved the strongest agreement in the study (72.3%, n=47), suggesting that LLM pragmatic reasoning closely mirrors trained-but-non-specialist linguistic intuition.
Gemini agreed with the expert annotator more than any human peer did (κ = 0.155–0.284), indicating that the LLM has internalized sociopragmatic patterns beyond surface-level sentiment.
All three peer raters achieved Moderate inter-rater agreement with each other (κ = 0.399–0.453), forming a distinct interpretive cluster separate from the expert's sociolinguistic framework.
ChatGPT achieved the lowest agreement with every rater, collapsing the 5-label taxonomy into a binary aggression/bonding classification. 72% of its labels used only two categories.
| κ Range | Interpretation |
|---|---|
| < 0.00 | Less than chance |
| 0.01–0.20 | Slight |
| 0.21–0.40 | Fair |
| 0.41–0.60 | Moderate |
| 0.61–0.80 | Substantial |
| 0.81–1.00 | Almost perfect |
Citation: Landis, J.R. & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.