| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Boolean Question Answering | BoolQ | Accuracy91.26 | 323 | |
| Question Answering | BoolQ | Accuracy90.9 | 317 | |
| Reading Comprehension | BOOLQ | Accuracy94.47 | 279 | |
| Common Sense Reasoning | BoolQ | Accuracy92.4 | 212 | |
| Text Classification | BoolQ | Accuracy90.7 | 84 | |
| Reading Comprehension | BoolQ | Accuracy (BoolQ)86.23 | 55 | |
| Question Answering | BoolQ (test) | Accuracy91.752 | 46 | |
| Factual Knowledge | Bool Q | Accuracy87.7 | 44 | |
| Boolean Question Answering | BoolQ (test) | Accuracy (Avg)86.7 | 38 | |
| Boolean Question Answering | BoolQ | Zero-shot Accuracy0.8229 | 36 | |
| Reading Comprehension | BoolQ (val) | Accuracy97.7 | 34 | |
| Yes/No Reading Comprehension | BoolQ 1.0 (test) | Normalized Accuracy69 | 33 | |
| Boolean Question Answering | BoolQ | Accuracy92.3 | 29 | |
| Faithfulness evaluation | BoolQ | AUC π-Soft-NS37 | 27 | |
| Boolean Question Answering | BoolQ | Delta Accuracy-0.01 | 24 | |
| Boolean Question Answering | BoolQ | Accuracy88 | 20 | |
| Citation and Evidence Recall | BoolQ M | Rk100 | 20 | |
| Binary Classification | BoolQ HELM | Balanced Accuracy89.75 | 18 | |
| Commonsense Reasoning | BoolQ | Accuracy87.29 | 18 | |
| Boolean Question Answering | BoolQ | Calibrated Accuracy86.1 | 18 | |
| Zero-shot Prediction | BoolQ | Accuracy77.68 | 17 | |
| Question Answering | BoolQ | Accuracy91.7 | 16 | |
| Explanation Evaluation | BoolQ (test) | Sufficiency20.78 | 16 | |
| Reading Comprehension | BoolQ (test) | Accuracy99.87 | 16 | |
| Boolean Question Answering | BoolQ | Acc (Normalized)85.3 | 15 |