| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Error Detection | FRAMES (test) | Precision97 | 36 | |
| Error Detection | FRAMES | F1 Score95 | 36 | |
| Multi-hop Question Answering | FRAMES | Accuracy86 | 34 | |
| Multi-hop Question Answering | Frames | ACCE41.38 | 24 | |
| Long-context Question Answering | FRAMES | Avg@4 Score73.54 | 22 | |
| Long-context reasoning | FRAMES | Score83.5 | 18 | |
| Agentic Search | Frames | String-F136.6 | 14 | |
| Deep Research | FRAMES | Accuracy56 | 14 | |
| Question Answering | FRAMES | Accuracy82.5 | 14 | |
| Document-level retrieval | FRAMES (test) | Recall73.3 | 13 | |
| Document Question Answering | FRAMES | EM10.5 | 13 | |
| Multi-hop Reasoning and Fact-checking | FRAMES | Average @390.6 | 13 | |
| Complex Reasoning | Frames | Accuracy90.6 | 13 | |
| Information Retrieval | FRAMES | Recall81.5 | 11 | |
| Question Answering | FRAMES out-domain (test) | LasJ31.31 | 11 | |
| Multi-hop Factual Reasoning | FRAMES | Accuracy82.3 | 10 | |
| Task-oriented Dialogue | Frames | Success Rate (SR)50.57 | 9 | |
| Fact Retrieval and Analysis | FRAMES | Accuracy90.6 | 9 | |
| Agentic Reasoning | FRAMES n=50 (full) | Accuracy77.31 | 8 | |
| Multi-step Reasoning and Factuality | FRAMES | Pass@190.6 | 7 | |
| Search | Frames | Score70.5 | 7 | |
| Deep search QA | Frames | Accuracy46.42 | 6 | |
| Evidence Retrieval | FRAMES | Evidence Coverage Rate55.8 | 6 | |
| Multi-hop QA Retrieval | FRAMES | NDCG0.834 | 5 | |
| Agentic tasks | Frames | Accuracy70.45 | 5 |