| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agent Performance | tau-bench | Retail Accuracy78.3 | 55 | |
| Multi-turn tool-use interaction | tau-bench | Retail Success Rate86.1 | 35 | |
| Stateful Agent-User Interaction | tau-Bench Airline | Pass@139.1 | 22 | |
| LLM Agent Evaluation | tau-bench Retail | Pass@164.8 | 22 | |
| Agentic Tool-use | τ2-Bench (Tau-bench) Retail and Telecom | Overall Success Rate85.79 | 17 | |
| Tool | tau2-Bench | Accuracy30.7 | 14 | |
| Tool-use | Tau-Bench | TAU-AIR Score67.5 | 14 | |
| Agentic Tool-Use | Tau-Bench | Retail Score71.3 | 13 | |
| Tool-use performance | Tau-bench Retail (test) | Pass Rate66 | 12 | |
| Tool-use Task Completion | Tau-Bench Airline v2 (test) | Pass Rate70 | 11 | |
| Tool Use | tau-bench retail domain | Accuracy57 | 10 | |
| Tool Use | tau-bench airline domain | Tool Use Accuracy (Airline)43.2 | 10 | |
| Agent Execution | tau-Bench (test) | Execution Accuracy59 | 8 | |
| LLM Agent Evaluation | tau-bench Airline | Accuracy42 | 7 | |
| Multi-turn agent decision making | tau-Bench (test) | Success Rate55.8 | 7 | |
| Tool Use | tau-Bench | Pass@185.4 | 6 | |
| Ranking Preservation | tau-bench Airline (test) | Mean Spearman Rho0.944 | 5 | |
| Function-calling | Tau-bench retail | Success Rate46 | 5 | |
| Function-calling | Tau-bench airline | Success Rate50 | 5 | |
| Agentic Performance | TAU2-Bench | Success Rate85.4 | 5 | |
| Malicious Action Detection | tau-Bench retail (test) | TPR7.8 | 4 |