MemGen: Weaving Generative Latent Memory for Self-Evolving Agents
About
Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a \textit{memory trigger}, which monitors the agent's reasoning state to decide explicit memory invocation, and a \textit{memory weaver}, which takes the agent's current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to $38.22\%$, exceeds GRPO by up to $13.44\%$, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy71.12 | 983 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy85.42 | 797 | |
| Mathematical Reasoning | MATH | Accuracy50.95 | 535 | |
| Mathematical Reasoning | MATH (test) | Overall Accuracy60.23 | 433 | |
| Question Answering | NarrativeQA (test) | ROUGE-L63.94 | 61 | |
| Science Reasoning | GPQA (test) | Accuracy21.68 | 41 | |
| Code Generation | KodCode | Accuracy57.7 | 38 | |
| Long document summarization | BookSum (test) | ROUGE 112.86 | 37 | |
| Question Answering | Wikihop (test) | Accuracy41.35 | 32 | |
| Question Answering | Merged QA HotpotQA, NarrativeQA, WikiHop (test) | Accuracy54.56 | 24 |