Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

About

Multimodal Large Language Models (MLLMs) have achieved remarkable performance by aligning pretrained visual representations with the linguistic knowledge embedded in Large Language Models (LLMs). However, existing approaches typically rely on final-layer visual features or learnable multi-layer fusion, which often fail to sufficiently exploit hierarchical visual cues without explicit cross-layer interaction design. In this work, we propose a Memory-Augmented Adapter (Mema) within the vision encoder. Specifically, Mema maintains a stateful memory that accumulates hierarchical visual representations across layers, with its evolution conditioned on both query embeddings and step-wise visual features. A portion of this memory is selectively injected into token representations via a feedback mechanism, thereby mitigating the attenuation of fine-grained visual cues from shallow layers. Designed as a lightweight and plug-and-play module, Mema integrates seamlessly into pretrained vision encoders without modifying the vanilla backbone architecture. Only a minimal set of additional parameters requires training, enabling adaptive visual feature refinement while reducing training overhead. Extensive experiments across multiple benchmarks demonstrate that Mema consistently improves performance, validating its effectiveness in complex multimodal reasoning tasks. The code have been released at https://github.com/Sisiliu312/Mema.

Ying Liu, Yudong Han, Kean Shi, Liyuan Pan• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal EvaluationMME--
658
Visual Question AnsweringScienceQA
Accuracy70.1
370
Hallucination EvaluationPOPE
Accuracy86.7
153
Visual Question AnsweringMMBench (MMB)
Accuracy64.7
76
Visual Question AnsweringDocVQA
ANLS21
38
Showing 5 of 5 rows

Other info

Follow for update