Causal Autoregressive Diffusion Language Model
About
In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy53.29 | 1460 | |
| Language Modeling | PTB | Perplexity97.74 | 650 | |
| Commonsense Reasoning | PIQA | Accuracy71.71 | 647 | |
| Language Modeling | WikiText | PPL38.67 | 479 | |
| Language Modeling | LAMBADA | Perplexity30.36 | 99 | |
| Language Modeling | OpenWebText | Perplexity17.59 | 50 | |
| Coreference Resolution | WinoGrande | Accuracy53.28 | 36 | |
| Language Modeling | arXiv | Perplexity20.34 | 21 | |
| Language Modeling | AG-News | PPL27.67 | 20 | |
| Language Modeling | Pubmed | Perplexity13.2 | 8 |