| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reading Comprehension | DROP | DROP Accuracy92.28 | 129 | |
| Reading Comprehension | DROP | F1 Score92.2 | 96 | |
| Reading Comprehension | DROP (test) | F1 Score96.42 | 76 | |
| Reading Comprehension | DROP (dev) | F1 Score88.1 | 63 | |
| Question Answering | DROP | F1 Score87.5 | 45 | |
| Generation | DROP | F1 Score32.9 | 43 | |
| Natural Language Reasoning | DROP | Accuracy89.62 | 43 | |
| Reasoning | DROP | Score89.27 | 42 | |
| Reading Comprehension | DROP (test) | F1 Score76 | 29 | |
| Reading Comprehension | DROP | F1 Score69.18 | 25 | |
| Reading Comprehension | DROP | DROP Score48.68 | 25 | |
| Discrete Reasoning | DROP | Exact Match (EM)71.59 | 25 | |
| Reading Comprehension | DROP (test) | Accuracy90.8 | 23 | |
| Question Answering | DROP MRQA out-of-domain evaluation | EM64.9 | 23 | |
| Video Reconstruction | Drop | PSNR35.03 | 21 | |
| Reading Comprehension | DROP | Loss0.4 | 20 | |
| Instruction-following | DROP | DROP Score51.53 | 20 | |
| Question Answering | DROP nfl | F1 Score67.69 | 17 | |
| In-context retrieval | DROP | Accuracy88.6 | 16 | |
| Multi-hop QA | DROP (test) | F1 Score87.9 | 14 | |
| Reading Comprehension | DROP MRQA out-of-domain | EM0.4884 | 14 | |
| Grayscale Video Reconstruction | Drop | PSNR45.46 | 13 | |
| Question Answering | DROP (test) | ROUGE76.78 | 12 | |
| Query-based Information Extraction | DROP | F1 Score64.64 | 12 | |
| Grounded Text Generation | DROP history | F151.17 | 11 |