APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

About

LLM agents have shown strong performance across a wide range of complex tasks, including interactive environments that require long-horizon decision making. But these agents cannot learn on the fly at test time. Self-evolving agents address this by accumulating memory and reflection across episodes rather than requiring model-weight updates. However, these agents often suffer from exploration collapse: as memory grows, behavior concentrates around familiar high-reward routines, reducing the chance of discovering better alternatives. To address this problem, we propose Autonomous Policy EXploration (APEX), which builds and maintains an explicit strategy space through a strategy map-a directed acyclic graph of milestones with prerequisite dependency edges. In APEX, Fork Discovery expands the map with evidence-grounded unexplored directions, while Policy Selection balances exploration and exploitation during planning. Evaluated on nine Jericho text-adventure games and WebArena, a realistic web interaction benchmark, APEX outperforms all baselines. Extensive ablations validate each component's contribution and demonstrate robustness across diverse settings, demonstrating APEX's effectiveness for sustained exploration in self-evolving agents.

Yibo Li, Jiashuo Yang, Zhi Zheng, Zhiyuan Hu, Yuan Sui, Shizun Wang, Yufei He, Bryan Hooi• 2026

Related benchmarks

Task	Dataset	Result
Web agent task completion	WebArena (test)	Shopping Success Rate42.9	18
Text Adventure Game Playing	Jericho	Zork1 Score73	6
Text-based Game Playing	Jericho	Zork1 Score73	6
Web navigation and interaction	WebArena (Final-3)	Shopping Success Rate42.9	6

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord