| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | SQuAD v1.1 (dev) | F1 Score95.8 | 375 | |
| Question Answering | SQuAD v1.1 (test) | F1 Score95.4 | 260 | |
| Question Answering | SQuAD 2.0 | F189.4 | 190 | |
| Question Answering | SQuAD v2.0 (dev) | F191.2 | 158 | |
| Question Answering | SQuAD | F189.8 | 127 | |
| Question Answering | SQuAD (test) | F191.2 | 111 | |
| Question Answering | SQuAD v1.1 | F194.7 | 79 | |
| Question Answering | SQuAD (dev) | F191 | 74 | |
| Question Answering | SQuAD v1.1 (val) | F1 Score96.22 | 70 | |
| Machine Reading Comprehension | SQuAD | EM89.9 | 58 | |
| Machine Reading Comprehension | SQuAD 2.0 (dev) | EM88.8 | 57 | |
| Machine Reading Comprehension | SQuAD 2.0 (test) | EM89.6 | 51 | |
| Question Answering | SQuAD | Exact Match93.33 | 50 | |
| Hallucination Detection | SQuAD (test) | AUROCr83.8 | 48 | |
| Machine Reading Comprehension | SQuAD 1.1 (dev) | EM89.71 | 48 | |
| Machine Reading Comprehension | SQuAD 1.1 (test) | EM89.898 | 46 | |
| Question Answering | SQuAD (test) | GPT Judge Accuracy89 | 45 | |
| Generation | SQuAD | F1 Score88.3 | 44 | |
| Open-domain question answering | SQUAD Open (test) | Exact Match56.6 | 39 | |
| Question Answering | SQuAD | F1 Score71.4 | 36 | |
| Extractive Question Answering | SQuAD 2.0 | F1 Score92.9 | 34 | |
| Question Answering | SQuAD 2.0 (test) | EM89.7 | 34 | |
| Calibration | SQuAD | ECE5.87 | 31 | |
| Open-domain Question Answering | SQuAD Open-domain 1.1 (test) | Exact Match (EM)61.8 | 30 | |
| Question Generation | SQuAD 1.1 (test) | BLEU-425.8 | 29 |