| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ToolSandbox (test) | H-EPM | Avg Task Reward0.704 | 27 | 3mo ago | |
| τ2-BENCH (test) | H-EPM | Average Task Reward0.921 | 27 | 3mo ago | |
| τ-BENCH (test) | H-EPM | Average Task Reward0.791 | 27 | 3mo ago | |
| ClawGym-Bench | Product & Collaboration Score76 | 17 | 1mo ago | ||
| PinchBench | Pass@188.7 | 17 | 1mo ago | ||
| tau2-bench, SkillsBench, and ALFWorld Average | SkillsInjector | Average Pass Rate58.7 | 9 | 5d ago | |
| SkillsBench | SkillsInjector | Pass Rate22.6 | 9 | 5d ago | |
| tau2-bench telecom | SkillsInjector | Pass Rate67 | 9 | 5d ago | |
| tau2-bench airline | SkillsInjector | Pass Rate60 | 9 | 5d ago | |
| AppWorld | PREPING | TGC Success Rate (N)83.7 | 7 | 19d ago | |
| τ-bench-retail | SkillMAS | Success Rate70.2 | 5 | 22d ago | |
| τ-bench airline | FAMA | Pass@136.8 | 3 | 1mo ago | |
| ToolSandbox | GPT-5.1 with H-EPM | Average Task Reward0.67 | 2 | 3mo ago | |
| τ²-Bench | GPT-5.1 with H-EPM | Avg Task Reward92.1 | 2 | 3mo ago |