VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization

About

Current multimodal large language models (MLLMs) face a critical challenge in modality alignment, often exhibiting a bias towards textual information at the expense of other modalities like vision. This paper conducts a systematic information-theoretic analysis of the widely used cross-entropy loss in MLLMs, uncovering its implicit alignment objective. Our theoretical investigation reveals that this implicit objective has inherent limitations, leading to a degradation of cross-modal alignment as text sequence length increases, thereby hindering effective multimodal information fusion. To overcome these drawbacks, we propose Vision-Text Alignment (VISTA), a novel approach guided by our theoretical insights. VISTA introduces an explicit alignment objective designed to maximize cross-modal mutual information, preventing the degradation of visual alignment. Notably, VISTA enhances the visual understanding capabilities of existing MLLMs without requiring any additional trainable modules or extra training data, making it both efficient and practical. Our method significantly outperforms baseline models across more than a dozen benchmark datasets, including VQAv2, MMStar, and MME, paving the way for new directions in MLLM modal alignment research.

Mingxiao Li, Na Su, Fang Qu, Zhizhou Zhong, Ziyang Chen, Yuan Li, Zhaopeng Tu, Xiaolong Li• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy46.63	1455
Multimodal Understanding	MMBench	--	887
Diagram Question Answering	AI2D	AI2D Accuracy56.28	509
Optical Character Recognition	OCRBench	Score321	486
Document Visual Question Answering	DocVQA	ANLS22.66	301
Multimodal Understanding	MMMU	MMMU Score35.88	232
Multimodal Reasoning	RealworldQA	Mean@8 Accuracy56.34	40
Multimodal Understanding	MMStar	Score35.91	26
Visual Question Answering	OKVQA	Score56.3	10

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord