Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VariErr NLI: Separating Annotation Error from Human Label Variation

About

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white. To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs. VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.

Leon Weber-Genzel, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank• 2024

Related benchmarks

TaskDatasetResultRank
Natural Language InferenceANLI R1 1.0 (test)
Weighted F140.2
28
Natural Language InferenceANLI R2 1.0 (test)
Weighted F10.311
28
Natural Language InferenceANLI R3 1.0 (test)
Weighted F132.1
28
Natural Language InferenceANLI R1 (test)
Accuracy40.2
26
Natural Language InferenceANLI R3 (test)
Accuracy32.1
26
Natural Language InferenceANLI R2 (test)
Accuracy31.1
20
Natural Language Inference Distribution EstimationChaosNLI
KL Divergence3.604
12
Natural Language Inference Distribution EstimationChaosNLI 1.0 (dev)
KL Divergence (BERT FT)0.177
8
Natural Language Inference Distribution EstimationChaosNLI 1.0 (test)
KL Div (Dist)3.604
8
Showing 9 of 9 rows

Other info

Follow for update