| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| tau-bench | Retail Accuracy78.3 | 55 | 4d ago | ||
| ACEBench Agent | AgentSkiller | Agent Score78 | 36 | 4d ago | |
| HELD-OUT Suite | GPT-4 | HotpotQA Score52.1 | 7 | 4d ago | |
| WindowsAgentArena (test) | CoAct-1 | Office Score30.4 | 6 | 4d ago | |
| AgentInstruct HELD-IN | GPT-4 | HELD-IN2.75 | 6 | 4d ago |