Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

About

Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.

Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig• 2025

Related benchmarks

Task	Dataset	Result
General AI Assistant Tasks	GAIA	Accuracy9.1	291
Web navigation	WebArena	--	138
Software Engineering	SWE-bench Verified	Accuracy40.3	43
Agentic Task Completion	τ2-bench	Airline Success Rate28	19
Tool Use Evaluation	ToolSandbox	Similarity0.422	19
Operating System Control	AgentBench OS	Accuracy34.7	15
Web task automation	WebArena	Accuracy22.2	2

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord