DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

About

Diffusion-based decoding has recently emerged as an appealing alternative to autoregressive (AR) generation, offering the potential to update multiple tokens in parallel and reduce latency. However, diffusion vision language models (dVLMs) still lag significantly behind mainstream autoregressive vision language models. This is due to the scarcity and weaker performance of base diffusion language models (dLLMs) compared with their autoregressive counterparts. This raises a natural question: Can we build high-performing dVLMs directly from existing powerful AR models, without relying on dLLMs? We propose DiffusionVL, a family of dVLMs obtained by translating pretrained AR models into the diffusion paradigm via an efficient diffusion finetuning procedure that changes the training objective and decoding process while keeping the backbone architecture intact. Through an efficient diffusion finetuning strategy, we successfully adapt AR pretrained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance comparable to that of the same AR model finetuned with standard autoregressive visual instruction tuning. To enable practical open-ended generation, we further integrate block decoding, which supports arbitrary-length outputs and KV-cache reuse for faster inference. Our experiments demonstrate that despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement, with a 34.4% gain on the MMMU-Pro (vision) benchmark and 37.5% gain on the MME (Cog.) benchmark, alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMMU (val)	--	199
Chart Question Answering	ChartQA (test)	Accuracy84.2	190
Multimodal Understanding	SEED-Bench Image	Accuracy75.5	143
Diagram Question Answering	AI2D (test)	Accuracy82.2	142
Multimodal Understanding	MME Perception	--	59
Visual Question Answering	RealWorldQA (test)	Accuracy68	47
Multimodal Understanding	MME Cognition	Score675	45
Multimodal Understanding	MMBench en (dev)	Score83.5	38
Multimodal Understanding	MMStar (test)	Accuracy63.2	26
Multi-image Reasoning	Muirbench (test)	Accuracy47.2	24

Showing 10 of 22 rows

Other info

GitHub

Follow for update

@wizwand_team Discord