Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

About

Multimodal Large Language Models (MLLMs) have achieved remarkable performance by aligning pretrained visual representations with the linguistic knowledge embedded in Large Language Models (LLMs). However, existing approaches typically rely on final-layer visual features or learnable multi-layer fusion, which often fail to sufficiently exploit hierarchical visual cues without explicit cross-layer interaction design. In this work, we propose a Memory-Augmented Adapter (Mema) within the vision encoder. Specifically, Mema maintains a stateful memory that accumulates hierarchical visual representations across layers, with its evolution conditioned on both query embeddings and step-wise visual features. A portion of this memory is selectively injected into token representations via a feedback mechanism, thereby mitigating the attenuation of fine-grained visual cues from shallow layers. Designed as a lightweight and plug-and-play module, Mema integrates seamlessly into pretrained vision encoders without modifying the vanilla backbone architecture. Only a minimal set of additional parameters requires training, enabling adaptive visual feature refinement while reducing training overhead. Extensive experiments across multiple benchmarks demonstrate that Mema consistently improves performance, validating its effectiveness in complex multimodal reasoning tasks. The code have been released at https://github.com/Sisiliu312/Mema.

Ying Liu, Yudong Han, Kean Shi, Liyuan Pan• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Evaluation	MME	--	727
Visual Question Answering	ScienceQA	Accuracy70.1	446
Hallucination Evaluation	POPE	Accuracy86.7	217
Visual Question Answering	MMBench (MMB)	Accuracy64.7	86
Visual Question Answering	DocVQA	ANLS21	59

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord