Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
About
Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2025 | Accuracy99.2 | 214 | |
| Mathematical Reasoning | HMMT 2025 | Accuracy92.5 | 194 | |
| Mathematical Reasoning | HMMT25 | Accuracy (%)92.5 | 115 | |
| Code Generation | LiveCodeBench | Accuracy73.2 | 64 | |
| Mathematical Reasoning | IMO-Answer-Bench | Accuracy80.3 | 32 | |
| Mathematical Reasoning | BeyondAIME | Accuracy82.5 | 18 | |
| Knowledge Reasoning | GPQA Diamond | Accuracy (avg@8)75.4 | 16 | |
| Mathematical Reasoning | APEX 2025 | Accuracy16.7 | 14 | |
| Scientific Computation Reasoning | FrontierScience | Accuracy53 | 4 |