Causal Autoregressive Diffusion Language Model
About
In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy53.29 | 1896 | |
| Language Modeling | PTB | Perplexity97.74 | 1234 | |
| Commonsense Reasoning | PIQA | Accuracy71.71 | 757 | |
| Language Modeling | WikiText | PPL38.67 | 740 | |
| Language Modeling | LAMBADA | Perplexity30.36 | 198 | |
| Language Modeling | OpenWebText | Perplexity17.59 | 122 | |
| Coreference Resolution | WinoGrande | Accuracy53.28 | 61 | |
| Language Modeling | Pubmed | Perplexity13.2 | 59 | |
| Language Modeling | arXiv | Perplexity20.34 | 58 | |
| Language Modeling | LM1B | Perplexity29.61 | 39 |