Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

About

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

Sujung Hong, Chanyong Yoon, Seong Jae Hwang• 2026

Related benchmarks

Task	Dataset	Result
Multi-discipline Multimodal Understanding	MMMU	--	422
Multimodal Reasoning	MMMU	Accuracy49.3	220
Multimodal Reasoning	MMBench	Accuracy83.3	180
Multimodal Understanding	MMB	--	63
Visual Grounding	RefCOCOg	Accuracy65	52
Multimodal Perception	MME-P	MME-P Score1.55e+3	35
Visual Grounding	Ferret 100 sampled instances	Accuracy62.9	8
Long-form generation	DetailCaps 100 sampled instances	Score63.6	8
Multimodal Reasoning	MME (test)	Sum Score2.00e+3	8
Long-form generation	LLaVA-Bench	Score64.1	8

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord