| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agentic Reasoning | τ-Bench | Score62.58 | 100 | |
| Tool-use | τ-Bench | Average Pass@167.42 | 38 | |
| Long-context Reasoning | ∞ Bench | Accuracy90.39 | 32 | |
| Agent Task Completion | τ-BENCH (test) | Average Task Reward0.791 | 27 | |
| Tool-use Agent Performance | τ²-bench | Retail Success Rate82.5 | 19 | |
| User Simulator Goal Alignment | τ-Bench Retail (test) | User Profile Success Rate94.5 | 14 | |
| User simulator goal alignment | τ-Bench Retail | User Profile Adherence94.5 | 14 | |
| User simulator goal alignment | τ-Bench Airline | User Profile Alignment (Prof.)98.7 | 14 | |
| Tool Use Reasoning | τ-Bench | Avg Accuracy63.9 | 14 | |
| Long-context language tasks (MC, QA, Sum) | ∞Bench | MC Accuracy78.6 | 13 | |
| Long-context Question Answering | ∞Bench | Accuracy78.46 | 13 | |
| Question Answering | ∞-Bench Longbook QA English (test) | Tokens4,096 | 9 | |
| Tool Use | τ²-Bench (out-of-distribution) | Retail Score54.9 | 8 | |
| Task-oriented Dialogue | τ-bench 157 scenarios | Collaboration SR45.5 | 7 | |
| Agentic Dialogue | τ-Bench (test) | Retail Accuracy60.4 | 7 | |
| Web Agent Task Success | τ-bench | Retail Success Rate76.52 | 6 | |
| Agentic Tool Use | τ²-Bench Telecom | Accuracy99.3 | 5 | |
| Agentic Tool Use | τ²-Bench Airline | Accuracy67.5 | 5 | |
| Agentic Tool Use | τ²-Bench Retail | Accuracy84.7 | 5 | |
| Agent & OpenClaw | τ²-Bench | Accuracy76.6 | 5 | |
| Tool-use Agent Robustness | τ-bench | Behavioral Uncertainty (BU)6.9 | 5 | |
| Tool-use | τ-bench | τ-bench Score38.3 | 4 | |
| Tool-calling | τ-Bench (test) | TSR43.77 | 4 | |
| Tool-use Agent Evaluation | τ-bench retail domain (All 115 tasks) | Pass@143.9 | 4 | |
| Tool-use Agent Evaluation | τ-bench retail domain (Last 105 tasks) | Pass@142.5 | 4 |