Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

About

Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a \textit{memory trigger}, which monitors the agent's reasoning state to decide explicit memory invocation, and a \textit{memory weaver}, which takes the agent's current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to $38.22\%$, exceeds GRPO by up to $13.44\%$, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.

Guibin Zhang, Muxin Fu, Shuicheng Yan• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy71.12
1398
Mathematical ReasoningGSM8K (test)
Accuracy85.42
954
Instruction FollowingIFEval
IFEval Accuracy39.37
836
Mathematical ReasoningMATH
Accuracy50.95
535
Mathematical ReasoningMATH (test)
Overall Accuracy60.23
433
Question AnsweringNarrativeQA (test)
ROUGE-L63.94
88
Long-context ReasoningLocomo--
75
Embodied AI Task PlanningEB-ALFRED
Average Score14.33
72
Query AnsweringPersonaMem 32K context length
Query-Answering Accuracy70
60
Query AnsweringPersonaMem 128K context length
Query-Answering Accuracy0.66
60
Showing 10 of 37 rows

Other info

Follow for update