Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

About

LLM-based memory systems increasingly maintain facts that evolve over time, where a recurring failure is conflict resolution: when a fact has multiple contradictory values, which should the agent return? MemoryAgentBench (MAB; Hu et al., 2026) makes this explicit in its FactConsolidation task: facts are numbered, the counterfactual has the higher serial, and agents are told newer facts have larger serials. Yet every published system underperforms: HippoRAG-v2 reaches 54% on single-hop (FC-SH), BM25 48%, Mem0 18%, and the temporal KG Zep/Graphiti just 7%. Multi-hop is near-unsolved (at most 7% across 22 systems). We argue the bottleneck is the assembly step: baselines leave conflict resolution to LLM-mediated retrieval or generation rather than version-aware aggregation. A matched-setup comparison (same backbone, retrieval, chunking, TOP_K) shows that replacing the LLM-judgment answer pipeline with candidate-extraction plus Python max(serial) yields +10.8 points on FC-SH (gpt-4o-mini), widening from +8 at 6K to +21 at 262K. This is a whole-pipeline effect (resolver, prompt, format, and temperature vary jointly); isolating the resolver is future work. The recipe reaches 78.0% on FC-SH (gpt-4o-mini), 94.8% (gpt-4o), and 30.2% on FC-MH (gpt-4o-mini, rising to 51.5% with gpt-4o) via a per-hop deterministic extension of Self-Ask. At matched-262K, it beats HippoRAG-v2 by +28 points and the best published FC-MH result by +20. The implication is corrective for the subfield: the bottleneck on conflict resolution is assembly (post-retrieval aggregation), not storage. A LongMemEval knowledge-update check shows the mechanism ports from max(serial) to max(timestamp) but only ties LLM judgment (57.8% vs 64.4%, n=45): deterministic aggregation is the right primitive for current-value conflicts and must be composed with question-type-aware handling for broader memory QA.

Vikas Reddy, Sumanth Challaram• 2026

Related benchmarks

TaskDatasetResultRank
Fact ConsolidationMAB FC-SH (262K context) v3 (full)
Accuracy (SubEM)93
9
Fact Consolidation (Single-Hop)MemoryAgentBench (MAB) FC-SH 262K
Accuracy93
8
Single-Hop Fact-based ReasoningMAB FC-SH 262K v3 (test)
Accuracy93
8
Fact ConsolidationMAB FC-MH 262K context v3 (full)
Accuracy (SubEM)27
7
Fact Consolidation (Multi-Hop)MemoryAgentBench (MAB) FC-MH 262K
Accuracy27
5
Multi-Hop Fact-based ReasoningMAB FC-MH, 262K v3 (test)
Accuracy27
5
Fact Consolidation (Single-Hop)MemoryAgentBench (MAB) FC-SH 6K
Accuracy99
4
Fact Consolidation (Single-Hop)MemoryAgentBench (MAB) FC-SH 32K
Accuracy92
4
Fact Consolidation (Single-Hop)MemoryAgentBench (MAB) FC-SH 64K
Accuracy95
4
Fact Consolidation (Single-Hop)MemoryAgentBench FC-SH Average
Accuracy0.948
4
Showing 10 of 15 rows

Other info

Follow for update