| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | SQuAD v1.1 (dev) | F1 Score95.8 | 380 | |
| Question Answering | SQuAD v1.1 (test) | F1 Score95.4 | 260 | |
| Question Answering | SQuAD 2.0 | F189.4 | 190 | |
| Question Answering | SQuAD v2.0 (dev) | F191.2 | 163 | |
| Question Answering | SQuAD | F189.8 | 134 | |
| Prompt Injection Defense | Inj-SQuAD | Combined ASR0.11 | 123 | |
| Question Answering | SQuAD (test) | F191.2 | 111 | |
| Question Answering | SQuAD | Exact Match93.33 | 83 | |
| Question Answering | SQuAD v1.1 | F194.7 | 79 | |
| Question Answering | SQuAD (dev) | F191 | 74 | |
| Question Answering | SQuAD | ACE (General)0.112 | 70 | |
| Question Answering | SQuAD v1.1 (val) | F1 Score96.22 | 70 | |
| Machine Reading Comprehension | SQuAD | EM89.9 | 58 | |
| Machine Reading Comprehension | SQuAD 2.0 (dev) | EM88.8 | 57 | |
| Machine Reading Comprehension | SQuAD 2.0 (test) | EM89.6 | 51 | |
| Hallucination Detection | SQuAD (test) | AUROCr83.8 | 48 | |
| Machine Reading Comprehension | SQuAD 1.1 (dev) | EM89.71 | 48 | |
| Machine Reading Comprehension | SQuAD 1.1 (test) | EM89.898 | 46 | |
| Question Answering | SQuAD (test) | GPT Judge Accuracy89 | 45 | |
| Generation | SQuAD | F1 Score88.3 | 44 | |
| Open-domain question answering | SQUAD Open (test) | Exact Match56.6 | 39 | |
| Question Answering | SQuAD KRE-curated version | F1 Score72.6 | 36 | |
| Question Answering | SQuAD v2 | ASR Score1 | 36 | |
| Question Answering | SQuAD | F1 Score71.4 | 36 | |
| Extractive Question Answering | SQuAD 2.0 | F1 Score92.9 | 34 |