Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
About
Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | HotpotQA | F1 Score30.13 | 221 | |
| Question Answering | MuSiQue | EM23.33 | 84 | |
| Long-term memory evaluation | Locomo | Overall F145.09 | 70 | |
| Multi-hop Question Answering | Locomo | F142.57 | 67 | |
| Long-context Question Answering | Locomo | Average F145.09 | 64 | |
| Question Answering | NarrativeQA (test) | ROUGE-L5.23 | 61 | |
| Long-context Memory Retrieval | Locomo | Single-hop73.33 | 55 | |
| Open-domain Question Answering | Locomo | F10.2864 | 53 | |
| Single-hop Question Answering | Locomo | F10.4849 | 53 | |
| Interactive Decision-making | AlfWorld | PICK54 | 52 |