Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

About

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, based on classic theories from memory science and cognitive science, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and selective forgetting. Existing benchmarks either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Moreover, no existing benchmarks cover all four competencies. We introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark transforms existing long-context datasets and incorporates newly constructed datasets into a multi-turn format, effectively simulating the incremental information processing characteristic of memory agents. By carefully selecting and curating datasets, our benchmark provides comprehensive coverage of the four core memory competencies outlined above, thereby offering a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

Yuanzhe Hu, Yu Wang, Julian McAuley• 2025

Related benchmarks

Task	Dataset	Result
Fact Consolidation	MAB FC-SH (262K context) v3 (full)	Accuracy (SubEM)48	9
Fact Consolidation (Single-Hop)	MemoryAgentBench (MAB) FC-SH 262K	Accuracy61	8
Single-Hop Fact-based Reasoning	MAB FC-SH 262K v3 (test)	Accuracy61	8
Fact Consolidation	MAB FC-MH 262K context v3 (full)	Accuracy (SubEM)3	7
Fact Consolidation (Multi-Hop)	MemoryAgentBench (MAB) FC-MH 262K	Accuracy5	5
Multi-Hop Fact-based Reasoning	MAB FC-MH, 262K v3 (test)	Accuracy5	5
Fact Consolidation (Single-Hop)	MemoryAgentBench (MAB) FC-SH 6K	Accuracy63	4
Fact Consolidation (Single-Hop)	MemoryAgentBench (MAB) FC-SH 32K	Accuracy70	4
Fact Consolidation (Single-Hop)	MemoryAgentBench (MAB) FC-SH 64K	Accuracy75	4
Fact Consolidation (Single-Hop)	MemoryAgentBench FC-SH Average	Accuracy0.672	4

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord