Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

About

Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation, but also which positions should be committed together as context for later decoding steps. Existing confidence-based decoding ranks masked positions independently and commits the top-K positions, largely ignoring whether the committed tokens provide complementary visual grounding. We identify a step-level limitation of this strategy in multimodal settings: high-confidence tokens selected in the same step can rely on overlapping visual grounding, introducing visual redundancy among the committed tokens and leaving less complementary visual grounding available for later decoding. To quantify this effect, we introduce the Visual Redundancy Index (VRI), which measures visual grounding overlap among tokens committed in parallel. To control this redundancy during decoding, we propose Visual-Redundancy-Controlled Decoding (VRCD), a training-free inference-time decoding method that uses token-to-image attention to prioritize visually complementary positions. Across diverse multimodal benchmarks, VRCD reduces visual redundancy and remaining-position entropy with modest runtime overhead. In longer decoding experiments, it also achieves relative accuracy gains of up to 18.8% on M^3CoT and 6.9% on MMBench over confidence-based decoding. Code will be released at https://github.com/infiniteYuanyl/VRCD.

Yulin Yuan, Hongshuo Zhao, Xiangming Meng• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMBench
Accuracy69.87
847
Visual Question AnsweringInfoVQA
Accuracy32.94
195
Infographic Question AnsweringInfoVQA
ANLS14.2
117
Information Visual Question AnsweringInfoVQA
Accuracy30.37
110
Multi-modal ReasoningM3CoT
Accuracy41.99
90
Multi-modal Question AnsweringMMBench
Accuracy69.87
84
Visual Question AnsweringM3CoT
Accuracy41.99
71
Vision-Language UnderstandingMMBench
Accuracy54.84
64
Document Question AnsweringDocVQA
ANLS12.38
64
Document Visual Question AnsweringDocVQA
Accuracy54.33
43
Showing 10 of 12 rows

Other info

Follow for update