| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| General AI Assistant Tasks | GAIA | Accuracy93.2 | 291 | |
| General AI Assistant Task | GAIA (val) | Level 1 Score96.23 | 97 | |
| General AI Assistant tasks | GAIA | Avg Performance80.61 | 72 | |
| Deep search | gaia | Accuracy81.9 | 59 | |
| Agentic Evaluation | GAIA | Accuracy28.12 | 50 | |
| General AI Assistant tasks | GAIA | Pass@1 Score83.4 | 38 | |
| General AI Assistant Tasks | GAIA | Task Success Rate71.5 | 30 | |
| Reasoning | GAIA text | Average Accuracy69.9 | 28 | |
| Web Task Reasoning | GAIA (test) | Pass@184.5 | 25 | |
| Agentic Benchmarks | GAIA | Execution Time (min)1.6 | 25 | |
| Deep Search | GAIA text-only (val) | Accuracy70.9 | 24 | |
| Deep research | GAIA | Accuracy78.2 | 24 | |
| General AI assistant tasks | GAIA n=165 (dev) | Average Accuracy73.93 | 23 | |
| General AI Assistant Tasks | GAIA | Avg@8 Score88.5 | 22 | |
| Question Answering | GAIA | Accuracy (Pass@4)51 | 22 | |
| General AI Assistant Task | GAIA | Accuracy62 | 21 | |
| Embodied Agentic | GAIA | Accuracy0.672 | 21 | |
| Deep Research | GAIA text-only original (test) | Pass@174.1 | 20 | |
| Complex Reasoning | GAIA Text | Accuracy76.4 | 19 | |
| General AI Assistant | GAIA text-only | Score81.9 | 19 | |
| General AI Assistant | GAIA text | GAIA Average Score70.5 | 19 | |
| Long-Horizon Search Intelligence | GAIA | Pass@157.3 | 18 | |
| Multi-turn tool use | GAIA | Pass@176.4 | 18 | |
| General AI Assistant Reasoning | GAIA Full | Accuracy60.12 | 18 | |
| General AI Assistant Reasoning | GAIA (File/Reasoning/Others) | Accuracy56.21 | 18 |