| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Function Calling | BFCL V3 | Overall Accuracy79.3 | 104 | |
| Function Calling | BFCL Multi-Turn v3 | Overall Accuracy68.38 | 41 | |
| Function Calling | BFCL Individual Tools per Problem | Execution Accuracy95 | 30 | |
| Tool-use | BFCL Multi-turn | Accuracy54.75 | 24 | |
| Tool-use Inference | BFCL v2 | MAT Score5.31 | 22 | |
| Function Calling | BFCL Multi-turn | Accuracy42.3 | 22 | |
| Function Calling | BFCL Single-turn | Accuracy84.2 | 22 | |
| Function Calling | BFCL Simple Python | Accuracy0.938 | 20 | |
| Tool-use agentic performance | BFCL V3 | Avg@479.5 | 19 | |
| Tool-calling | BFCL Extended Setting | Non-Live Score85.81 | 18 | |
| Tool-calling | BFCL Standard Setting | Non-Live Accuracy86.46 | 18 | |
| Tool-Use Agent Evaluation | BFCL Multiturn (OOD) v3 (test) | Base Rate48 | 18 | |
| Function Calling | BFCL | Energy (Wh)4.2 | 18 | |
| Throughput Efficiency | BFCL | Throughput5,093 | 18 | |
| Tool-calling | BFCL | Non-Live Success Rate90.65 | 17 | |
| Function Calling | BFCL Multi-Turn v4 (test) | Overall Acc46.75 | 17 | |
| Multi-Turn Tool Calling | BFCL v4 (val) | Overall Accuracy85 | 15 | |
| Function Calling | BFCL | Accuracy77.9 | 14 | |
| Tool-Augmented Planning | BFCL v3 | Live Success Rate84.1 | 14 | |
| Tool-augmented reasoning | BFCL Multi-Turn v3 | Overall Score69.1 | 14 | |
| Function Calling | BFCL (Held-In) | Accuracy89.4 | 14 | |
| Function Calling | BFCL v4 | Score68.8 | 13 | |
| Tool calling | BFCL Multiple | Accuracy92.5 | 12 | |
| Function Calling | BFCL Exec v3 | Overall Accuracy94.6 | 12 | |
| Function Calling | BFCL Live v3 | Overall Accuracy77.9 | 12 |