Causal Autoregressive Diffusion Language Model

About

In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.

Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, JingBo Zhu• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy53.29	1896
Language Modeling	PTB	Perplexity97.74	1234
Commonsense Reasoning	PIQA	Accuracy71.71	757
Language Modeling	WikiText	PPL38.67	740
Language Modeling	LAMBADA	Perplexity30.36	198
Language Modeling	OpenWebText	Perplexity17.59	122
Coreference Resolution	WinoGrande	Accuracy53.28	61
Language Modeling	Pubmed	Perplexity13.2	59
Language Modeling	arXiv	Perplexity20.34	58
Language Modeling	LM1B	Perplexity29.61	39

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord