Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization

About

Current multimodal large language models (MLLMs) face a critical challenge in modality alignment, often exhibiting a bias towards textual information at the expense of other modalities like vision. This paper conducts a systematic information-theoretic analysis of the widely used cross-entropy loss in MLLMs, uncovering its implicit alignment objective. Our theoretical investigation reveals that this implicit objective has inherent limitations, leading to a degradation of cross-modal alignment as text sequence length increases, thereby hindering effective multimodal information fusion. To overcome these drawbacks, we propose Vision-Text Alignment (VISTA), a novel approach guided by our theoretical insights. VISTA introduces an explicit alignment objective designed to maximize cross-modal mutual information, preventing the degradation of visual alignment. Notably, VISTA enhances the visual understanding capabilities of existing MLLMs without requiring any additional trainable modules or extra training data, making it both efficient and practical. Our method significantly outperforms baseline models across more than a dozen benchmark datasets, including VQAv2, MMStar, and MME, paving the way for new directions in MLLM modal alignment research.

Mingxiao Li, Na Su, Fang Qu, Zhizhou Zhong, Ziyang Chen, Yuan Li, Zhaopeng Tu, Xiaolong Li• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy46.63
1453
Multimodal UnderstandingMMBench--
847
Optical Character RecognitionOCRBench
Score321
433
Diagram Question AnsweringAI2D
AI2D Accuracy56.28
387
Document Visual Question AnsweringDocVQA
ANLS22.66
301
Multimodal UnderstandingMMMU
MMMU Score35.88
232
Multimodal ReasoningRealworldQA
Mean@8 Accuracy56.34
40
Multimodal UnderstandingMMStar
Score35.91
26
Visual Question AnsweringOKVQA
Score56.3
10
Showing 9 of 9 rows

Other info

Follow for update