Adversarial NLI: A New Benchmark for Natural Language Understanding

About

We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-the-art models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela• 2019

Related benchmarks

Task	Dataset	Result
Natural Language Inference	SNLI (test)	Accuracy91	694
Natural Language Understanding	GLUE (test)	SST-2 Accuracy94.9	416
Natural Language Inference	SciTail (test)	Accuracy94.4	86
Natural Language Inference	SNLI (dev)	Accuracy91.7	71
Factual Consistency Evaluation	TRUE benchmark	PAWS (AUC-ROC)86.35	37
Natural Language Inference	ANLI (test)	Overall Score55.1	28
Natural Language Inference	MNLI (val)	Accuracy90.01	26
Natural Language Inference	ANLI (val)	Accuracy73.37	21
Natural Language Inference	WANLI (test)	Accuracy67.04	21
Natural Language Inference	GNLI Human (test)	Accuracy82.86	21

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord