| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | SQuAD v1.1 (dev) | F1 Score95.8 | 380 | |
| Question Answering | SQuAD v1.1 (test) | F1 Score95.4 | 260 | |
| Question Answering | SQuAD 2.0 | F189.4 | 215 | |
| Question Answering | SQuAD v2.0 (dev) | F191.2 | 163 | |
| Question Answering | SQuAD | F189.8 | 162 | |
| Question Answering | SQuAD (test) | F191.2 | 156 | |
| Prompt Injection Defense | Inj-SQuAD | Combined ASR0.11 | 123 | |
| Question Answering | SQuAD v1.1 | F194.7 | 85 | |
| Question Answering | SQuAD | Exact Match93.33 | 83 | |
| Hallucination Detection | SQuAD | AUROC0.89 | 82 | |
| Question Answering | SQuAD (dev) | F191 | 74 | |
| Question Answering | SQuAD | ACE (General)0.112 | 70 | |
| Question Answering | SQuAD v1.1 (val) | F1 Score96.22 | 70 | |
| Question Answering | SQuAD | F1 Score94.7 | 63 | |
| Machine Reading Comprehension | SQuAD | EM89.9 | 58 | |
| Machine Reading Comprehension | SQuAD 2.0 (dev) | EM88.8 | 57 | |
| Generation | SQuAD | F1 Score88.3 | 52 | |
| Machine Reading Comprehension | SQuAD 2.0 (test) | EM89.6 | 51 | |
| Hallucination Detection | SQuAD (test) | AUROCr83.8 | 48 | |
| Machine Reading Comprehension | SQuAD 1.1 (dev) | EM89.71 | 48 | |
| Machine Reading Comprehension | SQuAD 1.1 (test) | EM89.898 | 46 | |
| Question Answering | SQuAD (test) | GPT Judge Accuracy89 | 45 | |
| Hallucination detection | SQuAD | AUC85.5 | 40 | |
| Reading Comprehension | SQuAD | Attack Accuracy75.91 | 40 | |
| Membership Inference Attack | SQuAD | AUC0.883 | 39 |