ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

About

Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .

Hao Chen, Zhexin Hu, Jiajun Chai, Haocheng Yang, Hang He, Xiaohan Wang, Wei Lin, Luhang Wang, Guojun Yin, Zhuofeng zhao• 2025

Related benchmarks

Task	Dataset	Result
General Question Answering	NQ	Exact Match (EM)30.3	52
Multi-hop Question Answering	MuSiQue	Score37.8	16
Multi-hop Question Answering	Bamboogle	Score48	16
General Question Answering	TriviaQA	Score16.8	16

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord