Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

About

Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent's ``zone of proximal development''. This process is guided by a simple but effective $\alpha$-Curriculum Reward, which aligns task difficulty with the agent's current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3\%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.

Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, Mengdi Wang• 2025

Related benchmarks

TaskDatasetResultRank
Function CallingBFCL (Berkeley Function Calling Leaderboard)
Base Score41.8
28
Tool-augmented ReasoningAPI-Bank
Success Rate79.1
12
Compositional multi-hop QABamboogle
Success Rate76
12
End-to-end PlanningTravelPlanner
Success Rate (CS/HD Avg)0.166
12
Embodied Instruction FollowingALFWorld official (val)
Success Rate54.5
12
Showing 5 of 5 rows

Other info

GitHub

Follow for update