| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Interactive Tool-Use Agent Performance | tau2-Bench | Retail Performance Score81.1 | 84 | |
| Task failure prediction and selective task completion | tau2-bench Telecom 1.0 | AUROC0.809 | 15 | |
| Task failure prediction and selective task completion | tau2-bench Retail 1.0 | AUROC0.707 | 15 | |
| Task failure prediction and selective task completion | tau2-bench Airline 1.0 | AUROC74.2 | 15 | |
| Long-Horizon Tool Execution | tau2-Bench | Retail Success Rate75.2 | 12 | |
| Multi-turn agent decision making | tau2-Bench (test) | Success Rate22.3 | 7 | |
| Tool Use | Tau2-Bench | Success Rate57.4 | 6 | |
| Agentic Tool-use | tau2-bench Airline | Pass@163.5 | 6 | |
| Agentic Tool-use | tau2-bench Retail | Pass@182 | 6 | |
| Agentic Skill Acquisition | tau2-bench | Pass@176.7 | 5 | |
| Agent Task Success | tau2-bench Retail Domain | Metric- | 0 | |
| Agent Task Success | tau2-bench Airline Domain | Metric- | 0 |