Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees

About

Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model's intrinsic long-chain reasoning. In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without human annotation. DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore diverse tool-integrated trajectories. Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors. Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning.

Kun Li, Zenan Xu, Junan Li, Zengrui Jin, Jinghao Deng, Zexuan Qiu, Bo Zhou• 2026

Related benchmarks

TaskDatasetResultRank
Expert-Level Question AnsweringGPQA Diamond
Pass@166.65
39
Mathematical ReasoningAIME24
Pass@173.47
18
Mathematical ReasoningAIME 25
Pass@10.6556
18
Showing 3 of 3 rows

Other info

Follow for update