| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agentic Reasoning | τ-Bench | Score62.58 | 100 | |
| Long-context Reasoning | ∞ Bench | Accuracy90.39 | 32 | |
| Agent Task Completion | τ-BENCH (test) | Average Task Reward0.791 | 27 | |
| Tool Use Reasoning | τ-Bench | Avg Accuracy63.9 | 14 | |
| Long-context language tasks (MC, QA, Sum) | ∞Bench | MC Accuracy78.6 | 13 | |
| Long-context Question Answering | ∞Bench | Accuracy78.46 | 13 | |
| Tool-use Agent Performance | τ²-bench | Pass@156.4 | 12 | |
| Tool Use | τ²-Bench (out-of-distribution) | Retail Score54.9 | 8 | |
| Agentic Dialogue | τ-Bench (test) | Retail Accuracy60.4 | 7 | |
| Agent Task Completion | τ²-Bench | Avg Task Reward92.1 | 2 | |
| Text-to-All Generation | Bench | CLIP-FID (FG)25.5 | 2 | |
| Background Generation | Bench | CLIP-FID (Compositional)21 | 2 | |
| Foreground Generation | Bench | CLIP-FID (Comp.)13.4 | 2 | |
| Failure attribution | τ-bench | Agent Accuracy75.9 | 2 | |
| Agentic Reasoning | τ-Bench (test) | Score- | 0 |