| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Tool Retrieval | ToolBench | NDCG@1058.54 | 44 | |
| Tool Retrieval | ToolBench In-domain I1 | NDCG@193.76 | 29 | |
| Tool-use | ToolBench | Average Pass Rate71.3 | 29 | |
| Tool Reasoning | ToolBench (G3) | Pass Rate91.8 | 24 | |
| Tool Reasoning | ToolBench G2 | Pass Rate93 | 24 | |
| Tool Reasoning | ToolBench (G1) | Pass Rate85.5 | 24 | |
| Tool Retrieval | ToolBench In-domain (I3) | NDCG@191.74 | 20 | |
| Tool Retrieval | ToolBench In-domain (I2) | NDCG@191.91 | 20 | |
| Tool Use | ToolBench | Energy (Wh)5.6 | 18 | |
| Throughput Efficiency | ToolBench | Throughput (tokens/s)4,602 | 18 | |
| Tool-use | ToolBench | Average Token Length127 | 18 | |
| LLM Inference | ToolBench | Goodput (req/s)3.9 | 18 | |
| End-to-end Tool-use | ToolBench I1 v1 | SoPR56.13 | 16 | |
| Function Calling | ToolBench Average | Pass Rate60.3 | 14 | |
| Function Calling | ToolBench I3-Inst | Pass Rate52.4 | 14 | |
| Function Calling | ToolBench I2-Inst | Pass Rate71.4 | 14 | |
| Function Calling | ToolBench I1-Inst | Pass Rate57.1 | 14 | |
| Tool Use | ToolBench 50 APIs v1 (test) | Wellformedness99.2 | 14 | |
| Tool Retrieval | ToolBench I3 (test) | Recall@376.63 | 13 | |
| Tool Retrieval | ToolBench I2 (test) | Recall@375.72 | 13 | |
| Tool planning | ToolBench G1 set | Win Rate (G1-Instruction)88.192 | 13 | |
| Tool-use Planning | ToolBench Average over all sets | Win Rate86.54 | 13 | |
| Tool-use Planning | ToolBench G3-Instruction | Win Rate0.9368 | 13 | |
| Tool-use Planning | ToolBench G2-Category | Win Rate78.78 | 13 | |
| Tool-use Planning | ToolBench G2-Instruction | Win Rate87.59 | 13 |