Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use

About

While most efforts to improve LLM-based tool-using agents focus on the agent itself - through larger models, better prompting, or fine-tuning - agent performance increasingly plateaus due to the quality of the tool interfaces these agents consume. Tool descriptions are often written for human developers and tolerate ambiguity that agents cannot resolve, particularly as the number of candidate tools grows. Existing approaches to improving tool interfaces (1) require re-running a multi-stage per-tool pipeline - synthesizing queries, executing an agent to collect trajectories, annotating trajectories, and prompting a strong LLM multiple times - for every API that enters the catalog, and (2) typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment, encouraging the model to internalize reusable patterns of what makes a tool description effective. To support this approach, we construct a large-scale dataset of high-quality tool interfaces derived from real-world APIs through a principled data synthesis workflow. Experiments on widely adopted benchmarks show that Trace-Free+ improves robustness as tool catalogs scale to 150+ candidates - in scaling experiments, reducing accuracy degradation by 29.23% and improving average query-level success by 60.89% on StableToolBench - generalizes across domains without retraining, and provides complementary gains on top of agent fine-tuning.

Ruocheng Guo, Kaiwen Dong, Xiang Gao, Kamalika Das• 2026

Related benchmarks

Task	Dataset	Result
Tool Use	StableToolBench G1 Category	SL76.8	12
Tool Use	StableToolBench G1 Instruction	SL Score75.5	6
Tool Use	StableToolBench G2 Category	SL71	6
Tool Use	StableToolBench G2 Instruction	SL Score68.8	6
Tool Use	StableToolBench Overall Average	SL (Success Rate)70.3	6
Tool Use	StableToolBench G3 Instruction	SL Score60.7	6
Tool Use	StableToolBench v1 (test)	G1 Category SL73.8	5
Tool Execution	Trace-based setting	Improvement (%)14.8	4
Tool selection	Trace-based setting	Improvement6.8	4
Tool Use	StableToolBench trace-free (test)	F1 Score (Impr Pts)6.8	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord