Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

About

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.

Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500--
236
Mathematical ReasoningAIME 2024
Mean Score (k=8)40.4
81
Mathematical ReasoningAMC 23
Avg@875.3
60
Mathematical ReasoningHMMT Feb 2025--
45
Mathematical ReasoningAIME 2025
Average@8 Score35
15
Mathematical ReasoningOlympiadBench
Average@8 Score49.8
10
Mathematical ReasoningGPQA Diamond
Average@8 Score45.3
10
Showing 7 of 7 rows

Other info

Follow for update