SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding

About

Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.

Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, Bowen Zhou• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	VideoMME	--	222
Multimodal Reasoning	MMMU-Pro	Accuracy28.2	171
Multimodal Reasoning	MMMU (val)	Accuracy44	168
Visual Question Answering	MMStar	Accuracy53.3	151
Document Question Answering	DocVQA (test)	Accuracy88.3	92
Video Understanding	MLVU	--	80
Long Video Understanding	MLVU (dev)	--	63
Multimodal Understanding	MMBench en (dev)	Score82.2	38
Document Understanding	AI2D	Accuracy0.796	28
General Visual Question Answering	RealworldQA	Accuracy66.1	25

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord