GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators
About
Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent's ``zone of proximal development''. This process is guided by a simple but effective $\alpha$-Curriculum Reward, which aligns task difficulty with the agent's current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3\%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Function Calling | BFCL (Berkeley Function Calling Leaderboard) | Base Score41.8 | 28 | |
| Tool-augmented Reasoning | API-Bank | Success Rate79.1 | 12 | |
| Compositional multi-hop QA | Bamboogle | Success Rate76 | 12 | |
| End-to-end Planning | TravelPlanner | Success Rate (CS/HD Avg)0.166 | 12 | |
| Embodied Instruction Following | ALFWorld official (val) | Success Rate54.5 | 12 |