Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning
About
LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single-shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed-size embeddings to model combinatorial tool compositions. To address these challenges, we propose TOOLQP, a lightweight framework that models retrieval as iterative query planning. Instead of single-shot matching, TOOLQP decomposes instructions into sub-tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub-tasks required for composition. We train TOOLQP using synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate that TOOLQP achieves state-of-the-art performance, exhibiting superior zero-shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tool Calling | API-Bank L-1 | -- | 46 | |
| Tool Calling | API-Bank L-2 | -- | 25 | |
| Tool Retrieval | TOOLRET In-Domain (Avg) | nDCG@1063.1 | 15 | |
| Tool Retrieval | TOOLRET Zero-Shot Code | nDCG@1032 | 15 | |
| Tool Retrieval | TOOLRET Zero-Shot Custom | nDCG@1045.8 | 15 | |
| Tool Retrieval | TOOLRET Zero-Shot Macro-Avg | nDCG@1036.9 | 15 | |
| Tool Retrieval | TOOLRET Zero-Shot Web* | nDCG@1033 | 15 | |
| Tool Calling | ToolBench generalization dataset (I2-Cat) | -- | 7 | |
| Tool Calling | StableToolBench (STB) I3-Inst | Solvable Pass Rate48.3 | 6 |