Visual Representation Alignment for Multimodal Large Language Models

About

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim• 2025

Related benchmarks

Task	Dataset	Result
Visual Reasoning	BLINK	Accuracy49.3	107
Multimodal Visual Perception	MMVP	Accuracy36	72
Real-world Question Answering	RealworldQA	--	58
Object Hallucination Evaluation	POPE (average across random and popular)	--	38
3D Computer Vision Benchmarking	CVBench3D	Accuracy62.3	24
Optical Character Recognition	OCR	Average Score36.1	20
Vision-centric Evaluation	CV-Bench 2D	Score60.5	15
2D Computer Vision Benchmarking	CVBench2D	Accuracy62	13
Knowledge-based Vision-Language Understanding	Knowledge	Average Score47.2	8
General Vision-Language Understanding	General	Avg Score70.9	8

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord