| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | SimpleQA | Accuracy95.3 | 114 | |
| Question Answering | SimpleQA Verified | Accuracy82.5 | 60 | |
| Agentic Evaluation | SimpleQA | Accuracy84 | 50 | |
| Multi-hop Question Answering | SimpleQA | Pass@118.8 | 36 | |
| Knowledge Graph Question Answering | SimpleQA Freebase-based (test) | Hits@184.8 | 31 | |
| Ranking Stability Analysis | SimpleQA and 4 Hallucination Benchmarks | Kendall's W0.9 | 28 | |
| Confidence Calibration | SimpleQA | Brier Score0.0386 | 27 | |
| Hallucination self-detection | SimpleQA | AUROC95.9 | 27 | |
| Factual QA | SimpleQA | Accuracy5.92 | 24 | |
| Question Answering with Abstention | SimpleQA | BAS0.2 | 24 | |
| Uncertainty Estimation | SimpleQA | AUROC61.3 | 24 | |
| Closed-book Question Answering | SimpleQA (train) | Accuracy18.8 | 21 | |
| Question Answering | SimpleQA | Accuracy32.9 | 20 | |
| Helpfulness | SimpleQA | Accuracy6.64 | 20 | |
| Short-form Question Answering | SimpleQA | ECE7.9 | 18 | |
| Question Answering | SimpleQA-verified OOD | Accuracy42.2 | 18 | |
| Tool Use | SimpleQA | Accuracy91.5 | 12 | |
| Question Answering | SimpleQA | Score56.3 | 11 | |
| Question Answering | SimpleQA | pass@12853.97 | 11 | |
| Question Answering | SimpleQA out-domain (test) | LasJ36.5 | 11 | |
| Generative Question Answering | SimpleQA | HALL Score92 | 10 | |
| Watermark Detection | SimpleQA | Delta_q0.81 | 10 | |
| Factual Question Answering | SimpleQA (test) | Accuracy79.07 | 10 | |
| web-agent QA | SimpleQA | F1 (Avg)67.8 | 8 | |
| Question Answering | SimpleQA | F1 Score56.1 | 7 |