Gumbel Distillation for Parallel Text Generation

About

The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at https://github.com/hxixixh/gumbel-distill.

Chi Zhang, Xixi Hu, Bo Liu, Qiang Liu• 2026

Related benchmarks

Task	Dataset	Result
Unconditional Text Generation	OpenWebText	Gen. PPL24.37	219
Language Modeling	LM1B (val)	Perplexity22.69	67
Language Modeling	WikiText (val)	Perplexity13.86	62
Language Modeling	AG News (val)	Perplexity18.19	36
Unconditional Generation	LM1B	Generation Perplexity46.06	31
Likelihood Estimation	PTB (val)	Perplexity35.12	4
Likelihood Estimation	LAMBADA (val)	Perplexity15.56	4
Likelihood Estimation	Pubmed Scientific Papers (val)	Perplexity19.78	4
Likelihood Estimation	Arxiv Scientific Papers (val)	Perplexity16.85	4
Unconditional Text Generation	OpenWebText	Clarity3.41	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord