ToolACE: Winning the Points of LLM Function Calling
About
Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Function Calling | BFCL V3 | -- | 104 | |
| Interactive Tool-Use Agent Performance | tau2-Bench | Retail Performance Score38.7 | 102 | |
| Agentic Tool-use | tau2-Bench | Retail Score0.00e+0 | 59 | |
| Agent Performance | Tau-Bench | Retail Accuracy37.4 | 55 | |
| Agent Performance | ACEBench Agent | Agent Score52 | 36 | |
| Agentic Capability Evaluation | ACEBench-en | Normal Score28.3 | 34 | |
| Function Calling | BFCL Live | Simple Accuracy82.95 | 24 | |
| Tool Use | BFCL Multi-turn | Accuracy37 | 24 | |
| Function Calling | BFCL Multi-Turn v4 (test) | Overall Acc37 | 17 | |
| Multi-turn tool-use | BFCL Multi-Turn v3 | Average Success Rate38.5 | 17 |