| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Interactive Tool-Use Agent Performance | tau2-Bench | Retail Performance Score81.1 | 102 | |
| Agentic Tool-use | tau2-bench Airline | Pass@163.5 | 22 | |
| Agentic Tool-use | tau2-bench Retail | Pass@182 | 22 | |
| Task failure prediction and selective task completion | tau2-bench Telecom 1.0 | AUROC0.809 | 15 | |
| Task failure prediction and selective task completion | tau2-bench Retail 1.0 | AUROC0.707 | 15 | |
| Task failure prediction and selective task completion | tau2-bench Airline 1.0 | AUROC74.2 | 15 | |
| Long-Horizon Tool Execution | tau2-Bench | Retail Success Rate75.2 | 12 | |
| Agentic Skill Acquisition | tau2-bench | Pass@181.2 | 9 | |
| Multi-turn agent decision making | tau2-Bench (test) | Success Rate22.3 | 7 | |
| General Task (Agentic Coding) | tau2-Bench Telecom | Score98.2 | 6 | |
| Tool Use | Tau2-Bench | Success Rate57.4 | 6 | |
| Agentic Tool Use | tau2-Bench | Accuracy64 | 2 | |
| Agent Task Success | tau2-bench Retail Domain | Metric- | 0 | |
| Agent Task Success | tau2-bench Airline Domain | Metric- | 0 |