| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | SimpleQA | Accuracy95.3 | 92 | |
| Question Answering | SimpleQA Verified | Accuracy82.5 | 60 | |
| Agentic Evaluation | SimpleQA | Accuracy84 | 50 | |
| Confidence Calibration | SimpleQA | Brier Score0.0386 | 27 | |
| Question Answering | SimpleQA-verified OOD | Accuracy42.2 | 18 | |
| Tool Use | SimpleQA | Accuracy91.5 | 12 | |
| Question Answering | SimpleQA out-domain (test) | LasJ36.5 | 11 | |
| Watermark Detection | SimpleQA | Delta_q0.81 | 10 | |
| Factual Question Answering | SimpleQA (test) | Accuracy79.07 | 10 | |
| Hallucination self-detection | SimpleQA | Accuracy67 | 8 | |
| Question Answering | SimpleQA | EM58.3 | 7 | |
| Fact-seeking Question Answering | SimpleQA no web | Accuracy55 | 7 | |
| Factuality | SimpleQA | Factuality Score35.3 | 7 | |
| Confidence Calibration | SimpleQA (test) | ECE6.8 | 7 | |
| Knowledge Retrieval | SimpleQA | Accuracy74.01 | 5 | |
| Agent Trajectory Performance | SimpleQA (test) | Pass@1 Accuracy77.1 | 4 | |
| Question Answering | SimpleQA (test) | SimpleQA Score4.05 | 3 | |
| Short-form QA | SimpleQA | Accuracy4 | 2 |