Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

About

Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .

Hao Chen, Zhexin Hu, Jiajun Chai, Haocheng Yang, Hang He, Xiaohan Wang, Wei Lin, Luhang Wang, Guojun Yin, Zhuofeng zhao• 2025

Related benchmarks

TaskDatasetResultRank
General Question AnsweringNQ
Exact Match (EM)30.3
52
Multi-hop Question AnsweringMuSiQue
Score37.8
16
Multi-hop Question AnsweringBamboogle
Score48
16
General Question AnsweringTriviaQA
Score16.8
16
Showing 4 of 4 rows

Other info

Follow for update