Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

About

LLM-based agents depend on effective tool-use policies to solve complex tasks, yet optimizing these policies remains challenging due to delayed supervision and the difficulty of credit assignment in long-horizon trajectories. Existing optimization approaches tend to be either monolithic, which are prone to entangling behaviors, or single-aspect, which ignore cross-module error propagation. To address these limitations, we propose EvoTool, a self-evolving framework that optimizes a modular tool-use policy via a gradient-free evolutionary paradigm. EvoTool decomposes agent's tool-use policy into four modules, including Planner, Selector, Caller, and Synthesizer, and iteratively improves them in a self-improving loop through three novel mechanisms. Trajectory-Grounded Blame Attribution uses diagnostic traces to localize failures to a specific module. Feedback-Guided Targeted Mutation then edits only that module via natural-language critique. Diversity-Aware Population Selection preserves complementary candidates to ensure solution diversity. Across four benchmarks, EvoTool outperforms strong baselines by over 5 points on both GPT-4.1 and Qwen3-8B, while achieving superior efficiency and transferability. The code will be released once paper is accepted.

Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, Eduard Hovy• 2026

Related benchmarks

TaskDatasetResultRank
Tool LearningRestBench TMDB
Success Rate86.2
32
Function CallingBFCL Multi-turn
Accuracy42.3
22
LLM Agent EvaluationTau-bench retail
Pass@164.8
22
Sequential Tool UseRestBench Spotify
Success Rate86.1
22
Stateful Agent-User InteractionTau-bench airline
Pass@139.1
22
Tool-use API GeneralizationToolBench G1 v1
Pass Rate83.5
22
Tool-use API GeneralizationToolBench G2
Pass Rate78.2
22
Tool-use API GeneralizationToolBench (G3)
Pass Rate71.5
22
Function CallingBFCL Single-Turn
Accuracy83.9
22
Showing 9 of 9 rows

Other info

Follow for update