| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Function Calling | BFCL V3 | Overall Accuracy79.3 | 104 | |
| Function Calling | BFCL Multi-Turn v3 | Overall Accuracy78.7 | 69 | |
| Function Calling | BFCL | False Negative Rate0 | 56 | |
| Tool Use | BFCL | Accuracy94 | 45 | |
| Tool-use Factuality Evaluation | BFCL Task | Factuality Score76 | 42 | |
| Function Calling | BFCL Individual Tools per Problem | Execution Accuracy95 | 30 | |
| Function Calling | BFCL | Success Rate (Simple)83.27 | 29 | |
| Function Calling | BFCL v4 | Score68.8 | 25 | |
| Function Calling | BFCL (Live) | Simple Accuracy88.25 | 24 | |
| Multi-Turn Function Calling | BFCL Multi-Turn Base v3 | Greedy Success Rate69 | 24 | |
| Tool-use | BFCL Multi-turn | Accuracy54.75 | 24 | |
| Tool-use Inference | BFCL v2 | MAT Score5.31 | 22 | |
| Function Calling | BFCL Multi-turn | Accuracy42.3 | 22 | |
| Function Calling | BFCL Single-turn | Accuracy84.2 | 22 | |
| Function Calling / Tool Use | BFCL parallel parallel-multiple Actions | Accuracy82.2 | 20 | |
| Function Calling | BFCL Memory | Task Accuracy28.22 | 20 | |
| Function Calling | BFCL V4 | Multi-Turn Success Rate62.3 | 20 | |
| Tool Usage | BFCL Multi-Parallel v2 | Accuracy87.5 | 20 | |
| Tool Usage | BFCL Parallel v2 | Accuracy87.5 | 20 | |
| Tool Usage | BFCL Multi-Parallel v1 | Accuracy90.5 | 20 | |
| Tool Usage | BFCL Parallel v1 | Accuracy95.5 | 20 | |
| Function Calling | BFCL Simple Python | Accuracy0.938 | 20 | |
| Tool-use agentic performance | BFCL V3 | Avg@479.5 | 19 | |
| Execution Accuracy | BFCL v2 | Non-Live AST Accuracy88.24 | 18 | |
| Tool-calling | BFCL Extended Setting | Non-Live Score85.81 | 18 |