| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ToolSandbox (test) | H-EPM | Avg Task Reward0.704 | 27 | 4d ago | |
| τ2-BENCH (test) | H-EPM | Average Task Reward0.921 | 27 | 4d ago | |
| τ-BENCH (test) | H-EPM | Average Task Reward0.791 | 27 | 4d ago | |
| ToolSandbox | GPT-5.1 with H-EPM | Average Task Reward0.67 | 2 | 4d ago | |
| τ²-Bench | GPT-5.1 with H-EPM | Avg Task Reward92.1 | 2 | 4d ago |