DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
About
Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new English reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 96k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on our generalized accuracy metric, while expert human performance is 96.0%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 47.0% F1.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reading Comprehension | DROP (dev) | F1 Score49.24 | 63 | |
| Reading Comprehension | DROP (test) | F1 Score96.42 | 61 | |
| Reading Comprehension | DROP 1.0 (test) | EM92.38 | 11 | |
| Reading Comprehension | DROP v1.0 (dev) | EM46.75 | 8 | |
| Discrete reasoning | DROP num (dev) | EM43.8 | 7 | |
| General Reasoning | ARC, BBH, DROP, CommonsenseQA, SIQA | CBRC Score71 | 5 |