MemGen: Weaving Generative Latent Memory for Self-Evolving Agents
About
Agent memory shapes how Large Language Model (LLM)-powered agents, akin to the human brain, progressively refine themselves through environment interactions. Existing paradigms remain constrained: parametric memory forcibly adjusts model parameters, and retrieval-based memory externalizes experience into structured databases, yet neither captures the fluid interweaving of reasoning and memory that underlies human cognition. To address this gap, we propose MemGen, a dynamic generative memory framework that equips agents with a human-esque cognitive faculty. It consists of a \textit{memory trigger}, which monitors the agent's reasoning state to decide explicit memory invocation, and a \textit{memory weaver}, which takes the agent's current state as stimulus to construct a latent token sequence as machine-native memory to enrich its reasoning. In this way, MemGen enables agents to recall and augment latent memory throughout reasoning, producing a tightly interwoven cycle of memory and cognition. Extensive experiments across eight benchmarks show that MemGen surpasses leading external memory systems such as ExpeL and AWM by up to $38.22\%$, exceeds GRPO by up to $13.44\%$, and exhibits strong cross-domain generalization ability. More importantly, we find that without explicit supervision, MemGen spontaneously evolves distinct human-like memory faculties, including planning memory, procedural memory, and working memory, suggesting an emergent trajectory toward more naturalistic forms of machine cognition.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy71.12 | 1398 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy85.42 | 954 | |
| Instruction Following | IFEval | IFEval Accuracy39.37 | 836 | |
| Mathematical Reasoning | MATH | Accuracy50.95 | 535 | |
| Mathematical Reasoning | MATH (test) | Overall Accuracy60.23 | 433 | |
| Question Answering | NarrativeQA (test) | ROUGE-L63.94 | 88 | |
| Long-context Reasoning | Locomo | -- | 75 | |
| Embodied AI Task Planning | EB-ALFRED | Average Score14.33 | 72 | |
| Query Answering | PersonaMem 32K context length | Query-Answering Accuracy70 | 60 | |
| Query Answering | PersonaMem 128K context length | Query-Answering Accuracy0.66 | 60 |