Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

About

Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.

Qianjia Cheng, Yuchen Zhang, Zhilin Wang, Yuxin Zuo, Shunkai Zhang, Yuchen Fan, Yu Qiao, Bowen Zhou, Ning Ding, Yu Cheng, Yun Luo, Ganqu Cui• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2025
Accuracy99.2
214
Mathematical ReasoningHMMT 2025
Accuracy92.5
194
Mathematical ReasoningHMMT25
Accuracy (%)92.5
115
Code GenerationLiveCodeBench
Accuracy73.2
64
Mathematical ReasoningIMO-Answer-Bench
Accuracy80.3
32
Mathematical ReasoningBeyondAIME
Accuracy82.5
18
Knowledge ReasoningGPQA Diamond
Accuracy (avg@8)75.4
16
Mathematical ReasoningAPEX 2025
Accuracy16.7
14
Scientific Computation ReasoningFrontierScience
Accuracy53
4
Showing 9 of 9 rows

Other info

Follow for update