Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning

About

Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to solve complex tasks by interacting with external tools, yet existing approaches depend on high-quality synthesized trajectories selected by scoring functions and sparse outcome-based rewards, providing limited and biased supervision for learning TIR. To address these challenges, in this paper, we propose AutoTraj, a two-stage framework that automatically learns TIR by repairing and rewarding tool-use trajectories. Specifically, in the supervised fine-tuning (SFT) stage, AutoTraj generates multiple candidate tool-use trajectories for each query and evaluates them along multiple dimensions. High-quality trajectories are directly retained, while low-quality ones are repaired using a LLM (i.e., LLM-as-Repairer). The resulting repaired and high-quality trajectories form a synthetic SFT dataset, while each repaired trajectory paired with its original low-quality counterpart constitutes a dataset for trajectory preference modeling. In the reinforcement learning (RL) stage, based on the preference dataset, we train a trajectory-level reward model to assess the quality of reasoning paths and combine it with outcome and format rewards, thereby explicitly guiding the optimization toward reliable TIR behaviors. Experiments on real-world benchmarks demonstrate the effectiveness of AutoTraj in TIR.

Siyu Gong, Linan Yue, Weibo Gao, Fangzhou Yao, Shimin Di, Lei Feng, Min-Ling Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy69.1
535
Mathematical ReasoningAMC 23
Accuracy47.5
198
Mathematical ReasoningAIME24
Accuracy93
130
Mathematical ReasoningGSM8K--
102
Mathematical ReasoningAIME 24
AIME 24 Accuracy23.33
84
Knowledge-intensive reasoningMuSiQue
Accuracy86
31
Mathematical ReasoningMATH--
24
Knowledge-intensive reasoningHLE
Avg Score85
23
Knowledge-intensive reasoningHQA
Average Score87
18
Knowledge-intensive reasoning2WikiMultihopQA
Accuracy29.5
18
Showing 10 of 15 rows

Other info

Follow for update