Adversarial NLI: A New Benchmark for Natural Language Understanding
About
We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-the-art models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.
Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela• 2019
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Inference | SNLI (test) | Accuracy91 | 681 | |
| Natural Language Understanding | GLUE (test) | SST-2 Accuracy94.9 | 416 | |
| Natural Language Inference | SciTail (test) | Accuracy94.4 | 86 | |
| Natural Language Inference | SNLI (dev) | Accuracy91.7 | 71 | |
| Factual Consistency Evaluation | TRUE benchmark | PAWS (AUC-ROC)86.35 | 37 | |
| Natural Language Inference | ANLI (test) | Overall Score55.1 | 28 | |
| Natural Language Inference | MNLI (val) | Accuracy90.01 | 26 | |
| Natural Language Inference | ANLI (val) | Accuracy73.37 | 21 | |
| Natural Language Inference | WANLI (test) | Accuracy67.04 | 21 | |
| Natural Language Inference | GNLI Human (test) | Accuracy82.86 | 21 |
Showing 10 of 17 rows