Gumbel Distillation for Parallel Text Generation
About
The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at https://github.com/hxixixh/gumbel-distill.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Unconditional Text Generation | OpenWebText | Gen. PPL24.37 | 100 | |
| Language Modeling | LM1B (val) | Perplexity22.69 | 55 | |
| Language Modeling | WikiText (val) | Perplexity13.86 | 54 | |
| Language Modeling | AG News (val) | Perplexity18.19 | 28 | |
| Unconditional Generation | LM1B | Generation Perplexity46.06 | 7 | |
| Likelihood Estimation | PTB (val) | Perplexity35.12 | 4 | |
| Likelihood Estimation | LAMBADA (val) | Perplexity15.56 | 4 | |
| Likelihood Estimation | Pubmed Scientific Papers (val) | Perplexity19.78 | 4 | |
| Likelihood Estimation | Arxiv Scientific Papers (val) | Perplexity16.85 | 4 | |
| Unconditional Text Generation | OpenWebText | Clarity3.41 | 4 |