Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use
About
The performance of LLM-based agents depends not only on the agent itself but also on the quality of the tool interfaces it consumes. While prior work has focused heavily on agent fine-tuning, tool interfaces-including natural language descriptions and parameter schemas-remain largely human-oriented and often become a bottleneck, especially when agents must select from large candidate tool sets. Existing approaches to improving tool interfaces rely on execution traces, which are frequently unavailable in cold-start or privacy-constrained settings, and typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment, encouraging the model to abstract reusable interface-usage patterns and tool usage outcomes. To support this approach, we construct a large-scale dataset of high-quality tool interfaces using a structured workflow over a diverse collection of tools. Experiments on StableToolBench and RestBench show consistent gains on unseen tools, strong cross-domain generalization, and robustness as the number of candidate tools scales to over 100, demonstrating that tool interface optimization is a practical and deployable complement to agent fine-tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tool Use | StableToolBench G1 Category | SL76.8 | 12 | |
| Tool Use | StableToolBench G1 Instruction | SL Score75.5 | 6 | |
| Tool Use | StableToolBench G2 Category | SL71 | 6 | |
| Tool Use | StableToolBench G2 Instruction | SL Score68.8 | 6 | |
| Tool Use | StableToolBench Overall Average | SL (Success Rate)70.3 | 6 | |
| Tool Use | StableToolBench G3 Instruction | SL Score60.7 | 6 | |
| Tool Use | StableToolBench v1 (test) | G1 Category SL73.8 | 5 | |
| Tool Execution | Trace-based setting | Improvement (%)14.8 | 4 | |
| Tool selection | Trace-based setting | Improvement6.8 | 4 | |
| Tool Use | StableToolBench trace-free (test) | F1 Score (Impr Pts)6.8 | 4 |