SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

About

Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, Bo An• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	--	895
Single-hop Question Answering	PopQA	--	186
Single-hop Question Answering	TriviaQA	--	133
General Reasoning	General Reasoning Suite Average	Pass@138.6	63
Mathematical Reasoning	AIME24 (test)	Pass@1 Score49.2	61
Tool Use	ToolBench	Average Pass Rate49.2	53
Travel Planning	TravelPlanner	Average Tokens Used16.2	46
Knowledge-intensive reasoning	MuSiQue	F1 Score17.8	43
Knowledge-intensive reasoning	HotpotQA	F1 Score0.307	41
Mathematical Reasoning	Mathematical Reasoning Evaluation Suite (AIME24, AIME25, MATH500, AMC23, Hmmt25, Olympiad)	AIME 2024 Score58.33	33

Showing 10 of 53 rows

Other info

Follow for update

@wizwand_team Discord