Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

About

Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on $\tau$-bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source 5K synthetic data trajectories and the trained xLAM-2-fc-r models to advance research in AI agents. Models at https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4; Dataset at https://huggingface.co/datasets/Salesforce/APIGen-MT-5k and Website at https://apigen-mt.github.io

Akshara Prabhakar, Zuxin Liu, Ming Zhu, Jianguo Zhang, Tulika Awalgaonkar, Shiyu Wang, Zhiwei Liu, Haolin Chen, Thai Hoang, Juan Carlos Niebles, Shelby Heinecke, Weiran Yao, Huan Wang, Silvio Savarese, Caiming Xiong• 2025

Related benchmarks

TaskDatasetResultRank
Function CallingBFCL V3
Overall Accuracy78.4
104
Interactive Tool-Use Agent Performancetau2-Bench
Retail Performance Score61.4
102
Agent PerformanceTau-Bench
Retail Accuracy67.1
55
Auto-biddingAuctionNet-Sparse
Score26.9
52
Tool Use∞Bench
Average Pass@150.52
38
Agent PerformanceACEBench Agent
Agent Score38.4
36
Interactive Tool-Use Agent PerformanceVitaBench
Cross Score4
30
Tool UseBFCL Multi-turn
Accuracy27.25
24
Tool-Use Agent EvaluationBFCL Multiturn (OOD) v3 (test)
Base Rate37.5
18
Tool CallingDIABENCH Static Evaluation 1.0
Accuracy48
17
Showing 10 of 29 rows

Other info

Follow for update