| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | SciQ | Accuracy97.2 | 283 | |
| Science Question Answering | SciQ | Normalized Accuracy97.7 | 137 | |
| Science Question Answering | SciQ | Accuracy (SciQ)94.3 | 101 | |
| Multiple Choice Question Answering | SciQ | Accuracy100 | 91 | |
| Question Answering | SciQ | PRR60 | 66 | |
| Selective Generation | SciQ | ROC-AUC86.1 | 66 | |
| Question Answering | SciQ | AUC87.79 | 51 | |
| Generation correctness prediction | SciQ (test) | AURC35.01 | 42 | |
| Generation correctness prediction | SciQ | AUROC77.99 | 42 | |
| Hallucination Detection | SciQ | AUC88.99 | 42 | |
| Question Answering | SciQ (train) | Accuracy100 | 36 | |
| Hallucination Detection | SciQ | AUROC0.9328 | 33 | |
| Question Answering | Sciq | Acc Norm86.4 | 32 | |
| Reading Comprehension | SciQ | Accuracy93.7 | 32 | |
| Question Answering | SciQ (test) | Accuracy85.4 | 28 | |
| Uncertainty quantification | SciQ (test) | AUROC74.5 | 28 | |
| Uncertainty Estimation (Factual QA) | SciQ 1,000 samples (val) | AUROC62.6 | 27 | |
| Scientific reasoning | SciQ | Accuracy97.08 | 25 | |
| Question Answering | SciQ In-Domain (test) | Precision83.68 | 24 | |
| STEM Question Answering | SciQ | First-Token Accuracy98.3 | 24 | |
| Factual Question Answering | SciQ (ID) | Precision76.44 | 24 | |
| Science Knowledge | SciQ | Accuracy90.9 | 22 | |
| Multi-turn Calibration | SciQ | ECE@14.42 | 21 | |
| Open-ended generation | SciQ | ECE5.21 | 21 | |
| Uncertainty Estimation | SciQ | AUROC82 | 18 |