| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Function Calling | BFCL V3 | Overall Accuracy79.3 | 88 | |
| Function Calling | BFCL Multi-Turn v3 | Overall Accuracy68.38 | 41 | |
| Function Calling | BFCL Individual Tools per Problem | Execution Accuracy95 | 30 | |
| Tool-use | BFCL Multi-turn | Accuracy54.75 | 24 | |
| Function Calling | BFCL | Energy (Wh)4.2 | 18 | |
| Throughput Efficiency | BFCL | Throughput5,093 | 18 | |
| Function Calling | BFCL Multi-Turn v4 (test) | Overall Acc46.75 | 17 | |
| Multi-Turn Tool Calling | BFCL v4 (val) | Overall Accuracy85 | 15 | |
| Tool-Augmented Planning | BFCL v3 | Live Success Rate84.1 | 14 | |
| Tool-augmented reasoning | BFCL Multi-Turn v3 | Overall Score69.1 | 14 | |
| Function Calling | BFCL (Held-In) | Accuracy89.4 | 14 | |
| Function Calling | BFCL v4 | Score68.8 | 13 | |
| Documentation Generation | BFCL Opaque | Semantic Similarity78 | 12 | |
| Tool Use | BFCL | Accuracy66.3 | 12 | |
| Function Calling | BFCL Executable (test) | Success Rate (Simple, Python)100 | 12 | |
| Function Calling | BFCL V3 (test) | Overall Accuracy63.34 | 11 | |
| Multi-Turn Function Calling | BFCL Multi-Turn Base v3 | Greedy Success Rate44.2 | 11 | |
| Agent & Alignment | BFCL v3 | Score75.61 | 10 | |
| Tool-use | BFCL Single-Turn | OA84.11 | 10 | |
| Function Calling | BFCL v3 2025-08-26 (test) | Multi-Turn Overall Accuracy50 | 9 | |
| Function calling | BFCL Multi-Turn Base v3 (val) | Avg@843 | 9 | |
| Tool Use | BFCL (test) | Accuracy90.2 | 9 | |
| Tool Use | BFCL Agentic v4 (out-of-distribution) | Web-base Score39 | 8 | |
| Function Calling | BFCL Simple Python | Accuracy0.923 | 8 | |
| Tool Calling | BFCL V3 | pass@170.4 | 7 |