Agent Learning via Early Experience

About

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios, and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm, we study two strategies of using such data: (1) implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. Evaluation across eight diverse environments and multiple model families shows that our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, making it a practical bridge between imitation learning and fully experience-driven agents.

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu• 2025

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	Pass@187.14	1043
Interactive Decision-making	AlfWorld	Overall Success Rate85.9	295
Arithmetic Reasoning	MultiArith	Accuracy96.28	293
Knowledge Reasoning	MMLU	MMLU Knowledge Reasoning Accuracy83.8	73
General Reasoning	GPQA Diamond	Pass@1 Accuracy51.85	47
Medical Question Answering	DDXPlus	Accuracy75.57	43
Multiple-choice Question Answering	AQUA	Accuracy75.44	43
Multi-hop Question Answering	HotpotQA	Avg@8 Accuracy85.4	32
Interactive Reasoning	ScienceWorld Seen	Success Rate60.82	31
Code Generation	DS-1000	Pass@152.35	28

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord