| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | SciQ | Accuracy97.2 | 283 | |
| Science Question Answering | SciQ | Normalized Accuracy97.7 | 137 | |
| Multiple Choice Question Answering | SciQ | Accuracy100 | 81 | |
| Science Question Answering | SciQ | Accuracy (SciQ)85.1 | 52 | |
| Question Answering | SciQ (train) | Accuracy100 | 36 | |
| Hallucination Detection | SciQ | AUROC0.9328 | 33 | |
| Reading Comprehension | SciQ | Accuracy93.7 | 32 | |
| Uncertainty quantification | SciQ (test) | AUROC74.5 | 28 | |
| Uncertainty Estimation (Factual QA) | SciQ 1,000 samples (val) | AUROC62.6 | 27 | |
| Question Answering | SciQ (test) | Accuracy80.7 | 26 | |
| Question Answering | SciQ In-Domain (test) | Precision83.68 | 24 | |
| STEM Question Answering | SciQ | First-Token Accuracy98.3 | 24 | |
| Factual Question Answering | SciQ (ID) | Precision76.44 | 24 | |
| Multi-turn Calibration | SciQ | ECE@14.42 | 21 | |
| Open-ended generation | SciQ | ECE5.21 | 21 | |
| Science Knowledge | SciQ | Accuracy88.4 | 21 | |
| Hallucination Detection | SciQ | Accuracy96 | 17 | |
| Multiple Choice Question Answering | SciQ MC | Mean Per-Step Regret0.137 | 15 | |
| Question Answering | SciQ Abstract | Mean per-step regret0.135 | 15 | |
| Distractor Generation | Sciq (test) | Precision@124.3 | 15 | |
| Language Modeling | SciQ | Perplexity11.95 | 13 | |
| Question Answering | SciQ (D_eval) | Accuracy71.4 | 12 | |
| Question Answering | SCIQ Generalization | Accuracy90.4 | 8 | |
| Question Answering | SciQ | Normalized Accuracy87.9 | 8 | |
| Science Question Answering | SciQ standard (test) | Accuracy90.2 | 8 |