DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

About

Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands coverage of diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, and an Evidence Collection--Task Derivation loop further induces rich multi-step tool-use patterns across 373 tools in five domains. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68. Remarkably, controlled scaling analysis reveals that diversity scaling consistently outperforms quantity scaling for OOD generalization, even with 4x less data.

Aili Chen, Chi Zhang, Junteng Liu, Jiangjie Chen, Chengyu Du, Yunji Li, Ming Zhong, Qin Wang, Zhengmao Zhu, Jiayuan Song, Ke Ji, Junxian He, Pengyu Zhao, Yanghua Xiao• 2026

Related benchmarks

Task	Dataset	Result
Medical Agent Task Execution	MedAgentBench	Success Rate57.3	24
Domain Deep Research Tool Use	FinSearchComp Global-T2	Success Rate67.3	12
Domain Deep Research Tool Use	FinSearchComp Global-T3	Success Rate37.3	12
In-distribution Tool Use	DIVE-Eval	Success Rate42.5	12
Financial Specialist Tool Use	Finance Agent Benchmark	Success Rate34	12
General Deep Research Tool Use	GAIA	Success Rate61.2	12
General Deep Research Tool Use	Browsecomp	Success Rate16.4	12
General Deep Research Tool Use	xbench DeepSearch	Success Rate58.1	12
General Deep Research Tool Use	HLE	Success Rate17.8	12
Zero-Shot Generalist Tool Use	Toolathlon	Success Rate8.3	12

Showing 10 of 11 rows

Other info

GitHub

Follow for update

@wizwand_team Discord