Agent Learning via Early Experience
About
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | Pass@187.14 | 1036 | |
| Arithmetic Reasoning | MultiArith | Accuracy96.28 | 229 | |
| Interactive Decision-making | AlfWorld | Overall Success Rate85.9 | 118 | |
| Knowledge Reasoning | MMLU | MMLU Knowledge Reasoning Accuracy83.8 | 65 | |
| General Reasoning | GPQA Diamond | Pass@1 Accuracy51.85 | 47 | |
| Medical Question Answering | DDXPlus | Accuracy75.57 | 43 | |
| Multi-hop Question Answering | HotpotQA | Avg@8 Accuracy85.4 | 32 | |
| Interactive Reasoning | ScienceWorld Seen | Success Rate60.82 | 31 | |
| Multiple-choice Question Answering | AQUA | Accuracy75.44 | 31 | |
| Code Generation | DS-1000 | Pass@152.35 | 28 |