Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

About

Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks. Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations. This issue stems from the predominant reliance on next-text-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers. To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM. Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information. Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities. Code of LaVer is available at https://github.com/Fir-lat/LaVer.

Hengzhuang Li, Xinsong Zhang, Qiming Peng, Bin Luo, Han Hu, Dengyang Jiang, Han-Jia Ye, Teng Zhang, Hai Jin• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal EvaluationMME
Score1.59e+3
557
Visual Question AnsweringChartQA--
239
Science Question AnsweringScienceQA--
229
Multimodal UnderstandingMMStar--
197
Diagram UnderstandingAI2D
Accuracy92.71
167
Hallucination EvaluationPOPE--
132
Reasoning SegmentationReasonSeg (test)--
102
Visual Question AnsweringRealworldQA
Accuracy59.35
98
Optical Character RecognitionOCRBench--
83
Multimodal UnderstandingMMMU
MMMU Score47.56
78
Showing 10 of 18 rows

Other info

Follow for update