Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

About

Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines in sample quality with few denoising steps.

Tianyu Xie, Shuchen Xue, Zijin Feng, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Cheng Zhang• 2025

Related benchmarks

Task	Dataset	Result
Language modelling	LM1B (test)	Perplexity69.71	151
Language Modeling	arXiv (test)	PPL36.39	145
Language Modeling	LAMBADA (test)	Perplexity47.3	109
Language Modeling	Wikitext (test)	Perplexity34.78	66
Language Modeling	PubMed (test)	Perplexity40.62	14
Language Modeling	AG News (test)	Perplexity68	14
Language Modeling	LM1B GPT-2 small model size equivalent (test)	Perplexity20.53	10
Image Generation	CIFAR-10 32 x 32 (test)	BPD2.74	4
Image Generation	Binarized MNIST 32 x 32 (test)	BPD0.063	2
Two-dimensional generative modeling	checkerboard	JS Divergence (1)0.062	2

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord