Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

About

Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation, but also which positions should be committed together as context for later decoding steps. Existing confidence-based decoding ranks masked positions independently and commits the top-K positions, largely ignoring whether the committed tokens provide complementary visual grounding. We identify a step-level limitation of this strategy in multimodal settings: high-confidence tokens selected in the same step can rely on overlapping visual grounding, introducing visual redundancy among the committed tokens and leaving less complementary visual grounding available for later decoding. To quantify this effect, we introduce the Visual Redundancy Index (VRI), which measures visual grounding overlap among tokens committed in parallel. To control this redundancy during decoding, we propose Visual-Redundancy-Controlled Decoding (VRCD), a training-free inference-time decoding method that uses token-to-image attention to prioritize visually complementary positions. Across diverse multimodal benchmarks, VRCD reduces visual redundancy and remaining-position entropy with modest runtime overhead. In longer decoding experiments, it also achieves relative accuracy gains of up to 18.8% on M^3CoT and 6.9% on MMBench over confidence-based decoding. Code is available at https://github.com/infiniteYuanyl/VRCD.

Yulin Yuan, Hongshuo Zhao, Xiangming Meng• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	Accuracy69.87	887
Visual Question Answering	InfoVQA	Accuracy32.94	264
Information Visual Question Answering	InfoVQA	Accuracy30.37	159
Infographic Question Answering	InfoVQA	ANLS14.2	117
Multi-modal Reasoning	M3CoT	Accuracy41.99	90
Vision-Language Understanding	MMBench	Accuracy54.84	88
Multi-modal Question Answering	MMBench	Accuracy69.87	84
Visual Question Answering	M3CoT	Accuracy41.99	71
Document Question Answering	DocVQA	ANLS12.38	64
Document Visual Question Answering	DocVQA	Accuracy54.33	54

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord