| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Error Detection | FRAMES (test) | Precision97 | 36 | |
| Error Detection | FRAMES | F1 Score95 | 36 | |
| Multi-hop Question Answering | Frames | ACCE41.38 | 24 | |
| Long-context Question Answering | FRAMES | Avg@4 Score73.54 | 22 | |
| Deep Research | FRAMES | Accuracy56 | 14 | |
| Question Answering | FRAMES | Accuracy82.5 | 14 | |
| Question Answering | FRAMES out-domain (test) | LasJ31.31 | 11 | |
| Multi-hop Factual Reasoning | FRAMES | Accuracy82.3 | 10 | |
| Fact Retrieval and Analysis | FRAMES | Accuracy90.6 | 9 | |
| Multi-hop Question Answering | FRAMES | Accuracy50 | 8 | |
| Agentic Reasoning | FRAMES n=50 (full) | Accuracy77.31 | 8 | |
| Search | Frames | Score70.5 | 7 | |
| Evidence Retrieval | FRAMES | Evidence Coverage Rate55.8 | 6 | |
| Out-of-Distribution Evaluation | Frames (OOD) | Avg@457.1 | 3 | |
| Multi-hop Question Answering | FRAMES Small-scale (evaluation) | Search Count3.2 | 1 |