MemR$^3$: Memory Retrieval via Reflective Reasoning for LLM Agents
About
Memory systems have been designed to leverage past experiences in Large Language Model (LLM) agents. However, many deployed memory systems primarily optimize compression and storage, with comparatively less emphasis on explicit, closed-loop control of memory retrieval. From this observation, we build memory retrieval as an autonomous, accurate, and compatible agent system, named MemR$^3$, which has two core mechanisms: 1) a router that selects among retrieve, reflect, and answer actions to optimize answer quality; 2) a global evidence-gap tracker that explicitly renders the answering process transparent and tracks the evidence collection process. This design departs from the standard retrieve-then-answer pipeline by introducing a closed-loop control mechanism that enables autonomous decision-making. Empirical results on the LoCoMo benchmark demonstrate that MemR$^3$ surpasses strong baselines on LLM-as-a-Judge score, and particularly, it improves existing retrievers across four categories with an overall improvement on RAG (+7.29%) and Zep (+1.94%) using GPT-4.1-mini backend, offering a plug-and-play controller for existing memory stores.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-term memory evaluation | Locomo | Overall F120.49 | 128 | |
| Question Answering | LoCoMo (test) | Single-hop Score86.43 | 24 | |
| Long-horizon memory-based reasoning | Locomo | Multi-hop R-1 Score17.34 | 10 | |
| Long-term Memory Retrieval | LongMemEval | Knowledge Update78.2 | 10 | |
| Factual Accuracy and Reasoning | Locomo | Single-hop Accuracy88.53 | 9 | |
| Proactive memory triggering | ProactiveMemBench | Recall@5 (Behavioral)53.8 | 8 | |
| Long-term Dialogue Memory Management | Locomo | F135.3 | 7 | |
| Long-term dialogue memory evaluation | GVD | Accuracy93 | 6 |