| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agent Performance | tau-bench | Retail Accuracy78.3 | 55 | |
| Multi-turn tool-use interaction | tau-bench | Retail Success Rate86.1 | 35 | |
| Agentic Tool-use | τ2-Bench (Tau-bench) Retail and Telecom | Overall Success Rate85.79 | 17 | |
| Tool-use | Tau-Bench | TAU-AIR Score67.5 | 14 | |
| Agentic Tool-Use | Tau-Bench | Retail Score71.3 | 13 | |
| Tool-use performance | Tau-bench Retail (test) | Pass Rate66 | 12 | |
| Multi-turn agent decision making | tau-Bench (test) | Success Rate55.8 | 7 | |
| Tool Use | tau-Bench | Pass@185.4 | 6 | |
| Function-calling | Tau-bench retail | Success Rate46 | 5 | |
| Function-calling | Tau-bench airline | Success Rate50 | 5 | |
| Agentic Performance | TAU2-Bench | Success Rate85.4 | 5 | |
| Tool | tau2-Bench | Accuracy15 | 4 |