Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ToolACE: Winning the Points of LLM Function Calling

About

Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE.

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang, Ruiming Tang, Defu Lian, Qun Liu, Enhong Chen• 2024

Related benchmarks

TaskDatasetResultRank
Interactive Tool-Use Agent Performancetau2-Bench
Retail Performance Score38.7
84
Agent PerformanceTau-Bench
Retail Accuracy37.4
55
Agent PerformanceACEBench Agent
Agent Score52
36
Tool UseBFCL Multi-turn
Accuracy37
24
Function CallingBFCL Multi-Turn v4 (test)
Overall Acc37
17
Pathological Multimodal UnderstandingPathMMU (test)
ACS40.5
13
Function CallingBerkeley Function Calling Leaderboard (BFCL) Live and Non-live
Non-live AST Score87.5
11
Tool UseBFCL Single-Turn
OA82.54
10
Function CallingBFCL non-live (test)
AST Accuracy (Simple Python)78.3
4
Function CallingAPIGen (test)
Score (Single)89.1
2
Showing 10 of 10 rows

Other info

Follow for update