EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

About

LLM-based agents depend on effective tool-use policies to solve complex tasks, yet optimizing these policies remains challenging due to delayed supervision and the difficulty of credit assignment in long-horizon trajectories. Existing optimization approaches tend to be either monolithic, which are prone to entangling behaviors, or single-aspect, which ignore cross-module error propagation. To address these limitations, we propose EvoTool, a self-evolving framework that optimizes a modular tool-use policy via a gradient-free evolutionary paradigm. EvoTool decomposes agent's tool-use policy into four modules, including Planner, Selector, Caller, and Synthesizer, and iteratively improves them in a self-improving loop through three novel mechanisms. Trajectory-Grounded Blame Attribution uses diagnostic traces to localize failures to a specific module. Feedback-Guided Targeted Mutation then edits only that module via natural-language critique. Diversity-Aware Population Selection preserves complementary candidates to ensure solution diversity. Across four benchmarks, EvoTool outperforms strong baselines by over 5 points on both GPT-4.1 and Qwen3-8B, while achieving superior efficiency and transferability. The code will be released once paper is accepted.

Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, Eduard Hovy• 2026

Related benchmarks

Task	Dataset	Result
Tool Learning	RestBench TMDB	Success Rate86.2	50
LLM Agent Evaluation	Tau-bench retail	Pass@164.8	38
Function Calling	BFCL Multi-turn	Accuracy42.3	22
Sequential Tool Use	RestBench Spotify	Success Rate86.1	22
Stateful Agent-User Interaction	Tau-bench airline	Pass@139.1	22
Tool-use API Generalization	ToolBench G1 v1	Pass Rate83.5	22
Tool-use API Generalization	ToolBench G2	Pass Rate78.2	22
Tool-use API Generalization	ToolBench (G3)	Pass Rate71.5	22
Function Calling	BFCL Single-Turn	Accuracy83.9	22
Sequential portfolio allocation	D full (frozen test)	Sharpe Ratio1.37	14

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord