Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning
About
Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking a learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns structured operations, including ADD, UPDATE, DELETE, and NOOP; and an Answer Agent that pre-selects and reasons over relevant entries. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management with minimal supervision. With only 152 training QA pairs, Memory-R1 outperforms strong baselines and generalizes across diverse question types, three benchmarks (LoCoMo, MSC, LongMemEval), and multiple model scales (3B-14B).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | Locomo | F133.64 | 125 | |
| Open-domain Question Answering | Locomo | F10.2355 | 111 | |
| Single-hop Question Answering | Locomo | F10.4686 | 111 | |
| Long-context Memory Evaluation | LongMemEval | -- | 103 | |
| Multi-hop Reasoning | Locomo | F1 Score36.55 | 68 | |
| Query Answering | PersonaMem 32K context length | Query-Answering Accuracy58 | 60 | |
| Query Answering | PersonaMem 128K context length | Query-Answering Accuracy0.61 | 60 | |
| Open Domain | Locomo | F1 Score29.34 | 51 | |
| Temporal | Locomo | F1 Score0.4126 | 47 | |
| Single-Hop | Locomo | F1 Score37.02 | 47 |