| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agent Task Completion | τ2-BENCH (test) | Average Task Reward0.921 | 27 | |
| Agentic Workflow Success | τ2-bench | Airline Success Rate60 | 13 | |
| Agentic | τ2-Bench | Score91.6 | 7 | |
| Multi-turn tool calling | τ2-bench | Overall Score17.77 | 5 | |
| Web-based Decision-making | τ2 Bench Retail, Telecom, Airline | Retail Score48.3 | 5 | |
| Agent | τ2-Bench | Accuracy69.5 | 4 |