| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Deep Research | DeepResearch Bench | RACE Overall53.08 | 22 | |
| Judge Agreement Accuracy | DeepResearch 1319 queries (test) | Agreement Accuracy74.5 | 19 | |
| Long-form deep research | DeepResearch Bench (test) | Overall Score48.24 | 13 |