| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agentic tool-use | Tau2-Bench | Retail Score90.4 | 59 | |
| Agent Performance | tau-bench | Retail Accuracy78.3 | 55 | |
| LLM Agent Evaluation | tau-bench Retail | Pass@164.8 | 38 | |
| Multi-turn tool-use interaction | tau-bench | Retail Success Rate86.1 | 35 | |
| LLM Agent Evaluation | tau-bench Airline | Pass@478 | 29 | |
| Stateful Agent-User Interaction | tau-Bench Airline | Pass@139.1 | 22 | |
| Agentic Performance | TAU2-Bench | Success Rate85.4 | 20 | |
| Agentic Tool-use | τ2-Bench (Tau-bench) Retail and Telecom | Overall Success Rate85.79 | 17 | |
| Tool | tau2-Bench | Accuracy30.7 | 14 | |
| Tool-use | Tau-Bench | TAU-AIR Score67.5 | 14 | |
| Agentic Tool-Use | Tau-Bench | Retail Score71.3 | 13 | |
| Tool-use performance | Tau-bench Retail (test) | Pass Rate66 | 12 | |
| Tool-use Task Completion | Tau-Bench Airline v2 (test) | Pass Rate70 | 11 | |
| Agent Performance | Tau-bench Telecom | Avg@4 Score45.39 | 10 | |
| Agent Performance | Tau-bench Retail | Avg@460.31 | 10 | |
| Original Tasks | extended tau-Bench Retail domain | Pass@190 | 10 | |
| Sister Tasks | extended tau-Bench Retail domain | Pass@1 Rate83 | 10 | |
| Original Tasks | tau-Bench Airline domain extended | Pass Rate @ Threshold 170 | 10 | |
| Sister Tasks | extended tau-Bench Airline domain | Pass@1 Rate74 | 10 | |
| Tool Use | tau-bench retail domain | Accuracy57 | 10 | |
| Tool Use | tau-bench airline domain | Tool Use Accuracy (Airline)43.2 | 10 | |
| Agent Execution | tau-Bench (test) | Execution Accuracy59 | 8 | |
| Agentic Function | Tau2-Bench | Score81.2 | 7 | |
| Multi-turn agent decision making | tau-Bench (test) | Success Rate55.8 | 7 | |
| Tool Use | tau-Bench | Pass@185.4 | 6 |