Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

About

Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.

Qianjia Cheng, Yuchen Zhang, Zhilin Wang, Yuxin Zuo, Shunkai Zhang, Yuchen Fan, Yu Qiao, Bowen Zhou, Ning Ding, Yu Cheng, Yun Luo, Ganqu Cui• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2025	Accuracy99.2	378
Mathematical Reasoning	HMMT 2025	Accuracy92.5	241
Mathematical Reasoning	HMMT25	Accuracy (%)92.5	115
Code Generation	LiveCodeBench	Accuracy73.2	64
Mathematical Reasoning	IMO-Answer-Bench	Accuracy80.3	32
Mathematical Reasoning	BeyondAIME	Accuracy82.5	18
Knowledge Reasoning	GPQA Diamond	Accuracy (avg@8)75.4	16
Mathematical Reasoning	APEX 2025	Accuracy16.7	14
Scientific Computation Reasoning	FrontierScience	Accuracy53	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord