| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Deep Research | DeepResearch Bench | RACE Overall54.02 | 31 | |
| Judge Agreement Accuracy | DeepResearch 1319 queries (test) | Agreement Accuracy74.5 | 19 | |
| Long-form deep research | DeepResearch Bench (test) | Overall Score48.24 | 13 | |
| Question Answering | DeepResearch | HotpotQA Score44.7 | 12 | |
| Deep Research | DeepResearch benchmark | Average Score53.4 | 8 | |
| Multimodal Report Generation | DeepResearch Bench | DLB3.72 | 7 | |
| Multi-hop Question Answering | DeepResearch-9K 29 (test) | P-hat0.45 | 6 |