SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning
About
Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Knowledge-intensive reasoning | Knowledge-Intensive Reasoning Suite 2Wiki., Bamb., HQA, MuSi., SimQA | 2Wiki Score16.1 | 25 | |
| Computational Reasoning | Computational Reasoning Suite AIME24, AIME25, AMC23, GSM8K, MATH | AIME24 Score17.5 | 10 | |
| Reasoning | 10 challenging reasoning tasks Combined | Average Score31.7 | 10 | |
| Multi-Turn Tool-Integrated Reasoning (TIR) | AIME25 | Peak avg@32 Score26.67 | 6 | |
| Multi-Turn Tool-Integrated Reasoning (TIR) | AIME24 | Peak avg@32 score37.91 | 6 | |
| Multi-Turn Tool-Integrated Reasoning (TIR) | AMC23 | Peak avg@32 Score71.25 | 6 | |
| Multi-Turn Tool-Integrated Reasoning (TIR) | MATH500 | Peak avg@32 Score82.25 | 6 | |
| Mathematical Reasoning | MATH 500 | Accuracy77 | 2 | |
| Mathematical Reasoning | AIME 2024 | Accuracy18.2 | 2 | |
| Mathematical Reasoning | AIME 2025 | Accuracy0.198 | 2 |