Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

About

Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, Bo An• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH500 (test)--
895
Single-hop Question AnsweringPopQA--
186
Single-hop Question AnsweringTriviaQA--
133
General ReasoningGeneral Reasoning Suite Average
Pass@138.6
63
Mathematical ReasoningAIME24 (test)
Pass@1 Score49.2
61
Tool UseToolBench
Average Pass Rate49.2
53
Travel PlanningTravelPlanner
Average Tokens Used16.2
46
Knowledge-intensive reasoningMuSiQue
F1 Score17.8
43
Knowledge-intensive reasoningHotpotQA
F1 Score0.307
41
Mathematical ReasoningMathematical Reasoning Evaluation Suite (AIME24, AIME25, MATH500, AMC23, Hmmt25, Olympiad)
AIME 2024 Score58.33
33
Showing 10 of 53 rows

Other info

Follow for update