MemDLM: Memory-Enhanced DLM Training
About
Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, standard DLM training uses a static, single-step masked prediction objective that never exposes the model to the progressive denoising dynamics of inference, and forces all contextual information to be maintained purely through token-space attention, which becomes increasingly diluted as context length grows. We propose MemDLM (Memory-Enhanced DLM), which introduces a second memory channel by embedding a simulated denoising trajectory into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience, while an outer loop updates the base model conditioned on this memory. By offloading part of the memorization burden from token-space attention to parameter space, MemDLM yields faster convergence, stronger long-context representations, and lower training loss, even when the fast weights are discarded at inference time. Re-enabling the inner loop at inference provides an additional prompt-specific adaptation effect, where the Parametric Memory acts as an emergent in-weight retrieval mechanism on challenging Needle-in-a-Haystack tasks. Code: https://github.com/JarvisPei/MemDLM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Needle-in-a-Haystack (NIAH) retrieval | RULER 4k | MV Accuracy100 | 6 | |
| Needle-in-a-Haystack (NIAH) retrieval | RULER 8k | MV Success Rate99.9 | 6 | |
| Needle-in-a-Haystack (NIAH) retrieval | BABILong 2K | Accuracy65.2 | 6 | |
| Needle-in-a-Haystack (NIAH) retrieval | BABILong 4K | Accuracy61.2 | 6 | |
| Needle-in-a-Haystack (NIAH) retrieval | BABILong 8K | Accuracy57 | 6 | |
| Long-context Language Understanding | LongBench | TriviaQA Score87.77 | 3 | |
| Needle-In-A-Haystack Retrieval | RULER 16k context length | RULER-MV Score29.4 | 3 | |
| Needle-In-A-Haystack Retrieval | RULER 32k context length | RULER-MV Retrieval Score15.35 | 3 | |
| Needle-In-A-Haystack Retrieval | Babilong 16k context length | Needle-in-a-Haystack Accuracy (16K)22.2 | 3 | |
| Needle-In-A-Haystack Retrieval | BABILong 32K context length | Accuracy9 | 3 |