| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Tool Use | StableToolBench | I2 Category Success72.8 | 28 | |
| Next-state prediction | StableToolBench (STB) | EM Accuracy49.25 | 16 | |
| Tool Use | StableToolBench cost-augmented | PR76 | 14 | |
| Agent Tool Use | StableToolBench Held-In | Pass Rate50.4 | 14 | |
| Tool Learning | StableToolBench Average | SoPR70.3 | 13 | |
| Tool Learning | StableToolBench I3-Inst. | SoPR76 | 13 | |
| Tool Learning | StableToolBench I2-Cat. | SoPR71.9 | 13 | |
| Tool Learning | StableToolBench I2-Inst. | SoPR73.4 | 13 | |
| Tool Learning | StableToolBench I1-Cat. | SoPR70.9 | 13 | |
| Tool Learning | StableToolBench I1-Tool | SoPR73.9 | 13 | |
| Tool Learning | StableToolBench I1-Inst. | SoPR69 | 13 | |
| Tool Use | StableToolBench G1 Category | SL76.8 | 12 | |
| Tool orchestration | StableToolBench 1.0 (test) | I1 Instruction Success Rate50.3 | 10 | |
| API Execution Simulation | StableToolBench | ID High Success Rate16.47 | 8 | |
| Tool Use | StableToolBench Overall Average | SL (Success Rate)70.3 | 6 | |
| Tool Use | StableToolBench G3 Instruction | SL Score66.3 | 6 | |
| Tool Use | StableToolBench G2 Instruction | SL Score68.8 | 6 | |
| Tool Use | StableToolBench G2 Category | SL71 | 6 | |
| Tool Use | StableToolBench G1 Instruction | SL Score75.5 | 6 | |
| Tool calling | StableToolBench (STB) I3-Inst | Solvable Pass Rate48.3 | 6 | |
| Tool Use | StableToolBench v1 (test) | G1 Category SL75.5 | 5 | |
| Tool Use | StableToolBench trace-free (test) | F1 Score (Impr Pts)6.8 | 4 |