| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agentic Task Performance | τ2-Bench Airline 1.0 (test) | CAP96.4 | 48 | |
| Agentic Task Performance | τ2-Bench Retail 1.0 (test) | Completion Accuracy (CAP)91.4 | 48 | |
| Agentic Workflow Success | τ2-bench | Airline Success Rate76.5 | 43 | |
| Agent | τ2-Bench | Accuracy85.4 | 41 | |
| Agent Task Completion | τ2-BENCH (test) | Average Task Reward0.921 | 27 | |
| Long-Horizon User-Centric Interaction | τ2-Bench | Telecom Success Rate46.9 | 23 | |
| Multi-turn tool calling | τ2-bench | Airline Score38 | 19 | |
| Agentic Task Completion | τ2-Bench | Airline Success Rate84 | 11 | |
| Agentic task | τ2-Bench Telecom | Avg@2 Score45 | 8 | |
| Agentic task | τ2-Bench Airline | Avg@460 | 8 | |
| Agentic task | τ2-Bench Retail | Avg@469.7 | 8 | |
| Agentic | τ2-Bench | Score91.6 | 7 | |
| Web-based Decision-making | τ2 Bench Retail, Telecom, Airline | Retail Score48.3 | 5 | |
| Agentic Task Completion | τ2-bench Retail | Success Rate100 | 4 | |
| Agentic Task Completion | τ2-bench Airline | Success Rate97 | 4 | |
| Memory-Poisoning Attack | τ2-Bench | Attack Hit Rate (AHR)90.5 | 3 | |
| Agentic Task Completion | τ2-bench Telecom | Success Rate100 | 3 | |
| Task Resolution | τ2-bench (test) | Success Rate (Airline)38.3 | 2 |