UnifiedQA: Crossing Format Boundaries With a Single QA System

About

Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UnifiedQA, that performs surprisingly well across 17 QA datasets spanning 4 diverse formats. UnifiedQA performs on par with 9 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UnifiedQA performs surprisingly well, showing strong generalization from its out-of-format training data. Finally, simply fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 6 datasets, establishing UnifiedQA as a strong starting point for building QA systems.

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hannaneh Hajishirzi• 2020

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Challenge	--	906
Commonsense Reasoning	PIQA	Accuracy85.3	757
Question Answering	OpenBookQA	Accuracy87.2	465
Physical Interaction Question Answering	PIQA	Accuracy89.5	415
Question Answering	ARC Easy	Normalized Acc92	391
Boolean Question Answering	BoolQ	Accuracy87.8	350
Question Answering	OBQA	Accuracy36.73	347
Multitask Language Understanding	MMLU (test)	Accuracy48.9	312
Reading Comprehension	RACE high	Accuracy90	295
Science Question Answering	ScienceQA (test)	Average Accuracy70.12	273

Showing 10 of 114 rows

...

Other info

Code

Follow for update

@wizwand_team Discord