| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Error Detection | FRAMES (test) | Precision97 | 36 | |
| Error Detection | FRAMES | F1 Score95 | 36 | |
| Multi-hop Question Answering | Frames | ACCE41.38 | 24 | |
| Multi-hop Question Answering | FRAMES | Accuracy86 | 22 | |
| Long-context Question Answering | FRAMES | Avg@4 Score73.54 | 22 | |
| Agentic Search | Frames | String-F136.6 | 14 | |
| Deep Research | FRAMES | Accuracy56 | 14 | |
| Question Answering | FRAMES | Accuracy82.5 | 14 | |
| Question Answering | FRAMES out-domain (test) | LasJ31.31 | 11 | |
| Multi-hop Factual Reasoning | FRAMES | Accuracy82.3 | 10 | |
| Fact Retrieval and Analysis | FRAMES | Accuracy90.6 | 9 | |
| Agentic Reasoning | FRAMES n=50 (full) | Accuracy77.31 | 8 | |
| Search | Frames | Score70.5 | 7 | |
| Deep search QA | Frames | Accuracy46.42 | 6 | |
| Evidence Retrieval | FRAMES | Evidence Coverage Rate55.8 | 6 | |
| Multi-hop QA Retrieval | FRAMES | NDCG0.834 | 5 | |
| Agentic tasks | Frames | Accuracy70.45 | 5 | |
| Multi-hop Question Answering | Frames out-of-domain | F1 Score0.413 | 4 | |
| Query Routing | FRAMES In-Distribution (test) | CPT (90%)77.9 | 4 | |
| Query Routing | FRAMES OOD | CPT 85%68.74 | 4 | |
| Query Routing | FRAMES | CPT (95%)88.84 | 4 | |
| Query Routing | FRAMES | CPT (90%)78.61 | 4 | |
| Model Routing | FRAMES (ID) | CPT (80%)60.92 | 4 | |
| Model Routing | FRAMES (ID queries) | CPT (85%) Score69.41 | 4 | |
| Query Routing | FRAMES | Hypervolume0.8865 | 4 |