Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

About

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to strengthen sustained, on-demand access to visual evidence. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for enhanced visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM shows improved robustness in longer generations and accelerates internal prediction convergence.

Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Zefeng He, Muxin Fu, Daizong Liu, Wei-Long Zheng, Yu Cheng• 2026

Related benchmarks

TaskDatasetResultRank
Mathematics ReasoningMathVision Mini
Accuracy51.3
35
Mathematical ReasoningMathVerse-V
Accuracy59.8
28
Multimodal UnderstandingMMStar (test)
Accuracy71.6
26
Multimodal UnderstandingMMMU (dev)
Accuracy67.3
25
Diagram UnderstandingAI2D lite
Accuracy82.8
20
Multimodal UnderstandingMMBench-CN lite
Accuracy91.2
20
Multimodal UnderstandingMMBench-EN lite
Accuracy89.4
20
Multimodal UnderstandingMMT emo
Accuracy58.3
20
Showing 8 of 8 rows

Other info

GitHub

Follow for update