Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

About

Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, Bo An• 2025

Related benchmarks

TaskDatasetResultRank
Knowledge-intensive reasoningKnowledge-Intensive Reasoning Suite 2Wiki., Bamb., HQA, MuSi., SimQA
2Wiki Score16.1
25
Computational ReasoningComputational Reasoning Suite AIME24, AIME25, AMC23, GSM8K, MATH
AIME24 Score17.5
10
Reasoning10 challenging reasoning tasks Combined
Average Score31.7
10
Multi-Turn Tool-Integrated Reasoning (TIR)AIME25
Peak avg@32 Score26.67
6
Multi-Turn Tool-Integrated Reasoning (TIR)AIME24
Peak avg@32 score37.91
6
Multi-Turn Tool-Integrated Reasoning (TIR)AMC23
Peak avg@32 Score71.25
6
Multi-Turn Tool-Integrated Reasoning (TIR)MATH500
Peak avg@32 Score82.25
6
Mathematical ReasoningMATH 500
Accuracy77
2
Mathematical ReasoningAIME 2024
Accuracy18.2
2
Mathematical ReasoningAIME 2025
Accuracy0.198
2
Showing 10 of 10 rows

Other info

Follow for update