BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

About

Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq$ 4.4M data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to 3$\times$ decoding throughput speedup compared to the source model. Code is available at https://github.com/fudan-generative-vision/Bard-VL.

Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Liwei Zhang, Yuxuan Yao, Weihao Yuan, Siyu Zhu• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	MMMU-Pro	Accuracy37.6	171
Multimodal Reasoning	MMMU (val)	Accuracy54.6	168
Visual Question Answering	MMStar	Accuracy65	151
Document Understanding	AI2D	Accuracy0.832	28
General Visual Question Answering	RealworldQA	Accuracy71.9	25
Document and chart understanding	ChartQA	Accuracy84.6	19
Multimodal Reasoning	MME	Sum Score2.39e+3	13

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord