Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MemDLM: Memory-Enhanced DLM Training

About

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, standard DLM training uses a static, single-step masked prediction objective that never exposes the model to the progressive denoising dynamics of inference, and forces all contextual information to be maintained purely through token-space attention, which becomes increasingly diluted as context length grows. We propose MemDLM (Memory-Enhanced DLM), which introduces a second memory channel by embedding a simulated denoising trajectory into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience, while an outer loop updates the base model conditioned on this memory. By offloading part of the memorization burden from token-space attention to parameter space, MemDLM yields faster convergence, stronger long-context representations, and lower training loss, even when the fast weights are discarded at inference time. Re-enabling the inner loop at inference provides an additional prompt-specific adaptation effect, where the Parametric Memory acts as an emergent in-weight retrieval mechanism on challenging Needle-in-a-Haystack tasks. Code: https://github.com/JarvisPei/MemDLM.

Zehua Pei, Hui-Ling Zhen, Weizhe Lin, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu• 2026

Related benchmarks

TaskDatasetResultRank
Needle-in-a-Haystack (NIAH) retrievalRULER 4k
MV Accuracy100
6
Needle-in-a-Haystack (NIAH) retrievalRULER 8k
MV Success Rate99.9
6
Needle-in-a-Haystack (NIAH) retrievalBABILong 2K
Accuracy65.2
6
Needle-in-a-Haystack (NIAH) retrievalBABILong 4K
Accuracy61.2
6
Needle-in-a-Haystack (NIAH) retrievalBABILong 8K
Accuracy57
6
Long-context Language UnderstandingLongBench
TriviaQA Score87.77
3
Needle-In-A-Haystack RetrievalRULER 16k context length
RULER-MV Score29.4
3
Needle-In-A-Haystack RetrievalRULER 32k context length
RULER-MV Retrieval Score15.35
3
Needle-In-A-Haystack RetrievalBabilong 16k context length
Needle-in-a-Haystack Accuracy (16K)22.2
3
Needle-In-A-Haystack RetrievalBABILong 32K context length
Accuracy9
3
Showing 10 of 10 rows

Other info

GitHub

Follow for update