| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ToolBench | Llama 3.3-70B | Average Success Rate (ASR)99.61 | 62 | 7d ago | |
| ToolBench | Qwen2.5-7B-Instruct-CAST | Average Pass Rate80.67 | 53 | 19d ago | |
| BFCL | Accuracy94 | 45 | 1d ago | ||
| τ-Bench | Average Pass@167.42 | 38 | 1mo ago | ||
| RoTBench Multi-turn | PA-Tool | Tool Selection Accuracy72.9 | 35 | 1mo ago | |
| RoTBench Single-turn | PA-Tool | Tool Selection84.8 | 35 | 1mo ago | |
| ToolBench (test) | AgentHER-MJ | Pass@183.7 | 28 | 2mo ago | |
| StableToolBench | ReAct+PLAY2PROMPT | I2 Category Success72.8 | 28 | 22h ago | |
| ToolAlpaca | EMA | Tool Use Success Rate77.9 | 26 | 21d ago | |
| MCPMark | Total Success Rate54.7 | 26 | 1mo ago | ||
| tool-use (test) | PBSD | Accuracy72 | 24 | 27d ago | |
| Synthetic Data (test) | RISE | Task Accuracy92.29 | 24 | 3mo ago | |
| BFCL Multi-turn | Accuracy54.75 | 24 | 3mo ago | ||
| RobustBench-TC Perturbed 1.0 (test) | Qwen3-14B | Accuracy (Perturbed)52.9 | 21 | 21d ago | |
| RobustBench-TC Clean 1.0 (test) | LoopTool-32B | Clean Accuracy77.9 | 21 | 21d ago | |
| Evaluation dataset | PORTool | Accuracy51.98 | 20 | 1mo ago | |
| ToolHop | HarnessForge | Answer Correctness54.87 | 18 | 23h ago | |
| CONFETTI | ToolWeave (O) | Accuracy45.45 | 18 | 20d ago | |
| API-Bank Level 2 | ToolWeave-R (G) | Accuracy66.22 | 18 | 20d ago | |
| ToolUse | GRPO | Task Accuracy69.2 | 18 | 14d ago | |
| ToolBench | Energy (Wh)5.6 | 18 | 3mo ago | ||
| Tool-use domain Aggregate | Agent-Dice | AvgZ Score0.79 | 18 | 3mo ago | |
| Tool-use domain Subset 3 | Functionality Score100 | 18 | 3mo ago | ||
| Tool-use domain Subset 2 | Func Success Rate99.63 | 18 | 3mo ago | ||
| Tool-use domain Subset 1 | Func99.63 | 18 | 3mo ago |