Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

About

While most efforts to improve LLM-based tool-using agents focus on the agent itself - through larger models, better prompting, or fine-tuning - agent performance increasingly plateaus due to the quality of the tool interfaces these agents consume. Tool descriptions are often written for human developers and tolerate ambiguity that agents cannot resolve, particularly as the number of candidate tools grows. Existing approaches to improving tool interfaces (1) require re-running a multi-stage per-tool pipeline - synthesizing queries, executing an agent to collect trajectories, annotating trajectories, and prompting a strong LLM multiple times - for every API that enters the catalog, and (2) typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment, encouraging the model to internalize reusable patterns of what makes a tool description effective. To support this approach, we construct a large-scale dataset of high-quality tool interfaces derived from real-world APIs through a principled data synthesis workflow. Experiments on widely adopted benchmarks show that Trace-Free+ improves robustness as tool catalogs scale to 150+ candidates - in scaling experiments, reducing accuracy degradation by 29.23% and improving average query-level success by 60.89% on StableToolBench - generalizes across domains without retraining, and provides complementary gains on top of agent fine-tuning.

Ruocheng Guo, Kaiwen Dong, Xiang Gao, Kamalika Das• 2026

Related benchmarks

TaskDatasetResultRank
Tool UseStableToolBench G1 Category
SL76.8
12
Tool UseStableToolBench G1 Instruction
SL Score75.5
6
Tool UseStableToolBench G2 Category
SL71
6
Tool UseStableToolBench G2 Instruction
SL Score68.8
6
Tool UseStableToolBench Overall Average
SL (Success Rate)70.3
6
Tool UseStableToolBench G3 Instruction
SL Score60.7
6
Tool UseStableToolBench v1 (test)
G1 Category SL73.8
5
Tool ExecutionTrace-based setting
Improvement (%)14.8
4
Tool selectionTrace-based setting
Improvement6.8
4
Tool UseStableToolBench trace-free (test)
F1 Score (Impr Pts)6.8
4
Showing 10 of 10 rows

Other info

Follow for update