Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

About

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AGENTCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.

Yiheng Shu, Bernal Jim\'enez Guti\'errez, Saisri Padmaja Jonnalagedda, Yuguang Yao, Huan Sun, Yu Su• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationBigCodeBench Lite-Pro Compositional Stream
Accuracy66.7
20
Code GenerationCodeEval-Pro BigCodeBench-Lite-Pro and HumanEval-Pro (1st Pass)
Average Accuracy66.7
18
Code GenerationCodeEval-Pro BigCodeBench-Lite-Pro and HumanEval-Pro (2nd Pass)
Average Accuracy64.6
18
Code GenerationCodeEval-Pro BigCodeBench-Lite-Pro and HumanEval-Pro (Held-out)
Average Accuracy71.7
18
Complex TasksBrowseComp+ Complex Tasks 1st Pass
Accuracy90
16
Complex TasksBrowseComp+ Complex Tasks 2nd Pass
Accuracy89
16
Code GenerationBigCodeBench Lite-Pro Naive Stream
Accuracy44.5
16
Code GenerationHumanEval-Pro (Held-out after Naive Stream)
Accuracy70.8
10
Complex Task SolvingBrowseComp+ Compositional Stream
Accuracy (1st-Q)90
8
Code GenerationHumanEval-Pro Held-out after Compositional Stream
Accuracy71.7
8
Showing 10 of 18 rows

Other info

Follow for update