Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning

About

Large Language Models (LLMs), when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs). Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies. AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration. Extensive evaluations across diverse knowledge-intensive, mathematical, and general language modeling tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior. These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs. The code and data are available at https://github.com/weiyifan1023/AutoTIR.

Yifan Wei, Xiaoyan Yu, Yixuan Weng, Tengfei Pan, Angsheng Li, Li Du• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy59
535
Mathematical ReasoningAMC 23
Accuracy35
198
Mathematical ReasoningAIME24
Accuracy80
130
Mathematical ReasoningGSM8K--
102
Mathematical ReasoningAIME 24
AIME 24 Accuracy6.67
84
Knowledge-intensive reasoningMuSiQue
Accuracy85
31
Mathematical ReasoningMATH--
24
Knowledge-intensive reasoningHLE
Avg Score85
23
Knowledge-intensive reasoningHQA
Average Score85
18
Knowledge-intensive reasoning2WikiMultihopQA
Accuracy25
18
Showing 10 of 16 rows

Other info

Follow for update