MemDLM: Memory-Enhanced DLM Training

About

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, standard DLM training uses a static, single-step masked prediction objective that never exposes the model to the progressive denoising dynamics of inference, and forces all contextual information to be maintained purely through token-space attention, which becomes increasingly diluted as context length grows. We propose MemDLM (Memory-Enhanced DLM), which introduces a second memory channel by embedding a simulated denoising trajectory into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience, while an outer loop updates the base model conditioned on this memory. By offloading part of the memorization burden from token-space attention to parameter space, MemDLM yields faster convergence, stronger long-context representations, and lower training loss, even when the fast weights are discarded at inference time. Re-enabling the inner loop at inference provides an additional prompt-specific adaptation effect, where the Parametric Memory acts as an emergent in-weight retrieval mechanism on challenging Needle-in-a-Haystack tasks. Code: https://github.com/JarvisPei/MemDLM.

Zehua Pei, Hui-Ling Zhen, Weizhe Lin, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu• 2026

Related benchmarks

Task	Dataset	Result
Needle-in-a-Haystack (NIAH) retrieval	RULER 4k	MV Accuracy100	6
Needle-in-a-Haystack (NIAH) retrieval	RULER 8k	MV Success Rate99.9	6
Needle-in-a-Haystack (NIAH) retrieval	BABILong 2K	Accuracy65.2	6
Needle-in-a-Haystack (NIAH) retrieval	BABILong 4K	Accuracy61.2	6
Needle-in-a-Haystack (NIAH) retrieval	BABILong 8K	Accuracy57	6
Long-context Language Understanding	LongBench	TriviaQA Score87.77	3
Needle-In-A-Haystack Retrieval	RULER 16k context length	RULER-MV Score29.4	3
Needle-In-A-Haystack Retrieval	RULER 32k context length	RULER-MV Retrieval Score15.35	3
Needle-In-A-Haystack Retrieval	Babilong 16k context length	Needle-in-a-Haystack Accuracy (16K)22.2	3
Needle-In-A-Haystack Retrieval	BABILong 32K context length	Accuracy9	3

Showing 10 of 10 rows

Other info

GitHub

Follow for update

@wizwand_team Discord