| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| τ-Bench | Average Pass@167.42 | 38 | 2d ago | ||
| ToolBench | Attention Buckets + DFSDT-Retriever | Average Pass Rate71.3 | 29 | 1mo ago | |
| ToolBench (test) | AgentHER-MJ | Pass@183.7 | 28 | 25d ago | |
| StableToolBench | ReAct+PLAY2PROMPT | I2 Category Success72.8 | 28 | 1mo ago | |
| Synthetic Data (test) | RISE | Task Accuracy92.29 | 24 | 1mo ago | |
| BFCL Multi-turn | Accuracy54.75 | 24 | 1mo ago | ||
| ToolBench | Energy (Wh)5.6 | 18 | 1mo ago | ||
| ToolBench | Mistral Large | Average Token Length127 | 18 | 1mo ago | |
| Tool-use domain Aggregate | Agent-Dice | AvgZ Score0.79 | 18 | 1mo ago | |
| Tool-use domain Subset 3 | Functionality Score100 | 18 | 1mo ago | ||
| Tool-use domain Subset 2 | Func Success Rate99.63 | 18 | 1mo ago | ||
| Tool-use domain Subset 1 | Func99.63 | 18 | 1mo ago | ||
| Tool-use domain Subset 0 | Func Success Rate99.64 | 18 | 1mo ago | ||
| API-Bank (test) | Accuracy92.6 | 16 | 1mo ago | ||
| ACEBench Parallel | Accuracy81 | 15 | 4d ago | ||
| ACEBench Single | Accuracy90 | 15 | 4d ago | ||
| Tool Use Non-Live | DARE+TA | Para0.925 | 15 | 1mo ago | |
| Tool Use Live | Base | Para Score56.25 | 15 | 1mo ago | |
| Causal and Downstream Robustness Ablation Suite Averaged over 4 models | HETA | Tool Hit@1Δ4.1 | 14 | 2d ago | |
| Tool use | SDPO (on-policy) | Avg@1668.5 | 14 | 15d ago | |
| Tau-Bench | TAU-AIR Score67.5 | 14 | 1mo ago | ||
| Task-Bench | Reasoningbank | Task Completion Rate58.2 | 14 | 1mo ago | |
| StableToolBench cost-augmented | INTENT | PR76 | 14 | 1mo ago | |
| ToolBench 50 APIs v1 (test) | Llama-2-Chat-7B | Wellformedness99.2 | 14 | 1mo ago | |
| Extended BFCL Single | Claude-Sonnet-4.5 | File Accuracy61.54 | 12 | 4d ago |