| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agentic Tool-use | tau^2 Bench official evaluation setting GPT-4.1 simulator | Retail Score0.775 | 9 | |
| Agentic Capability | tau^2-Bench Telecom | Pass@189 | 7 | |
| Agentic performance | TAU-2 Bench | Airline Score47.5 | 7 | |
| Tool Use | tau^2-Bench | Pass@185.4 | 5 |