TInR: Exploring Tool-Internalized Reasoning in Large Language Models

About

Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.

Qiancheng Xu, Yongqi Li, Fan Liu, Hongru Wang, Min Yang, Wenjie Li• 2026

Related benchmarks

Task	Dataset	Result
Tool Calling	In-domain (seen)	EM74.05	10
Tool Calling	In-domain unseen	Exact Match (EM)57.24	10
Tool Calling	BFCL out-of-domain	Exact Match (EM)26	10
Tool Identification	In-domain (seen)	Exact Match (EM)85.95	9
Tool Identification	In-domain unseen	EM75.86	9
Tool Identification	BFCL out-of-domain	Exact Match38.06	9
Tool Identification	BFCL multi-turn category (test)	Accuracy34.48	4
Tool Calling	Tool Use Evaluation (test)	Exact Match (EM)61.31	3
Tool Identification	Tool Use Evaluation (test)	EM Accuracy78.3	3
Tool Calling	ToolACE multi-turn (test)	Accuracy61.64	2

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord