Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

About

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

Sujung Hong, Chanyong Yoon, Seong Jae Hwang• 2026

Related benchmarks

TaskDatasetResultRank
Multi-discipline Multimodal UnderstandingMMMU--
363
Multimodal ReasoningMMMU
Accuracy49.3
208
Multimodal ReasoningMMBench
Accuracy83.3
127
Multimodal UnderstandingMMB--
53
Visual GroundingRefCOCOg
Accuracy65
45
Multimodal PerceptionMME-P
MME-P Score1.55e+3
25
Visual GroundingFerret 100 sampled instances
Accuracy62.9
8
Long-form generationDetailCaps 100 sampled instances
Score63.6
8
Multimodal ReasoningMME (test)
Sum Score2.00e+3
8
Long-form generationLLaVA-Bench
Score64.1
8
Showing 10 of 15 rows

Other info

Follow for update