APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
About
Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models -- the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on $\tau$-bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source 5K synthetic data trajectories and the trained xLAM-2-fc-r models to advance research in AI agents. Models at https://huggingface.co/collections/Salesforce/xlam-2-67ef5be12949d8dcdae354c4; Dataset at https://huggingface.co/datasets/Salesforce/APIGen-MT-5k and Website at https://apigen-mt.github.io
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Function Calling | BFCL V3 | Overall Accuracy78.4 | 88 | |
| Interactive Tool-Use Agent Performance | tau2-Bench | Retail Performance Score61.4 | 84 | |
| Agent Performance | Tau-Bench | Retail Accuracy67.1 | 55 | |
| Agent Performance | ACEBench Agent | Agent Score38.4 | 36 | |
| Interactive Tool-Use Agent Performance | VitaBench | Cross Score4 | 30 | |
| Tool Use | BFCL Multi-turn | Accuracy27.25 | 24 | |
| Tool-augmented Reasoning | BFCL Multi-Turn v3 | Overall Score69.1 | 14 | |
| Tool Use Reasoning | ∞Bench | Avg Accuracy48.3 | 14 | |
| Tool Use | Tau-Bench | TAU-AIR Score33 | 14 | |
| Tool Use | τ²-Bench (out-of-distribution) | Retail Score54.9 | 8 |