Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Causal Autoregressive Diffusion Language Model

About

In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.

Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, JingBo Zhu• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy53.29
1460
Language ModelingPTB
Perplexity97.74
650
Commonsense ReasoningPIQA
Accuracy71.71
647
Language ModelingWikiText
PPL38.67
479
Language ModelingLAMBADA
Perplexity30.36
99
Language ModelingOpenWebText
Perplexity17.59
50
Coreference ResolutionWinoGrande
Accuracy53.28
36
Language ModelingarXiv
Perplexity20.34
21
Language ModelingAG-News
PPL27.67
20
Language ModelingPubmed
Perplexity13.2
8
Showing 10 of 15 rows

Other info

Follow for update