| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agentic Workflow Success | τ2-bench | Airline Success Rate76.5 | 43 | |
| Agent Task Completion | τ2-BENCH (test) | Average Task Reward0.921 | 27 | |
| Long-Horizon User-Centric Interaction | τ2-Bench | Telecom Success Rate46.9 | 23 | |
| Agent | τ2-Bench | Accuracy85.4 | 9 | |
| Agentic Task Completion | τ2-Bench | Airline Success Rate44 | 7 | |
| Agentic | τ2-Bench | Score91.6 | 7 | |
| Multi-turn tool calling | τ2-bench | Overall Score17.77 | 5 | |
| Web-based Decision-making | τ2 Bench Retail, Telecom, Airline | Retail Score48.3 | 5 | |
| Task Resolution | τ2-bench (test) | Success Rate (Airline)38.3 | 2 |