| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| tau-bench | Retail Accuracy78.3 | 55 | 3mo ago | ||
| ACEBench Agent | AgentSkiller | Agent Score78 | 36 | 3mo ago | |
| VitaBench OTA | NoisyAgent | Avg@49.75 | 10 | 7d ago | |
| VitaBench In-Store | NoisyAgent | Avg@432.25 | 10 | 7d ago | |
| VitaBench Delivery | NoisyAgent | Avg@429 | 10 | 7d ago | |
| Tau-bench Telecom | NoisyAgent | Avg@4 Score45.39 | 10 | 7d ago | |
| Tau-bench Retail | NoisyAgent | Avg@460.31 | 10 | 7d ago | |
| AgentNoiseBench-Vita Noisy setting 1.0 (test) | NoisyAgent | Delivery Avg@4 Score28.75 | 10 | 7d ago | |
| AgentNoiseBench Noisy setting tau2 1.0 (test) | NoisyAgent | Retail Avg@4 Score43.2 | 10 | 7d ago | |
| HELD-OUT Suite | GPT-4 | HotpotQA Score52.1 | 7 | 3mo ago | |
| WindowsAgentArena (test) | CoAct-1 | Office Score30.4 | 6 | 3mo ago | |
| AgentInstruct HELD-IN | GPT-4 | HELD-IN2.75 | 6 | 3mo ago |