Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Causal Autoregressive Diffusion Language Model

About

In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.

Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, JingBo Zhu• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy53.29
1891
Language ModelingPTB
Perplexity97.74
1034
Commonsense ReasoningPIQA
Accuracy71.71
751
Language ModelingWikiText
PPL38.67
732
Language ModelingLAMBADA
Perplexity30.36
150
Language ModelingOpenWebText
Perplexity17.59
91
Language ModelingarXiv
Perplexity20.34
55
Coreference ResolutionWinoGrande
Accuracy53.28
40
Language ModelingPubmed
Perplexity13.2
38
Language ModelingAG-News
PPL27.67
36
Showing 10 of 15 rows

Other info

Follow for update