Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models

About

Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in multimodal tasks. Despite their impressive performance, MLLMs suffer from the modality imbalance issue, where visual information is often underutilized compared to textual representations in deeper layers, leading to degraded visual performance or hallucinations. This issue stems from the predominant reliance on next-text-token-prediction during training, which fails to provide direct visual supervisory signals, resulting in progressive homogenization of visual representations throughout the layers. To this end, we propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discriminative visual representations via masked image modeling in the joint latent semantic space of LLM. Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information. Extensive experiments across diverse benchmarks prove the superiority of our approach in various scenarios, especially those requiring dense visual capabilities. Code of LaVer is available at https://github.com/Fir-lat/LaVer.

Hengzhuang Li, Xinsong Zhang, Qiming Peng, Bin Luo, Han Hu, Dengyang Jiang, Han-Jia Ye, Teng Zhang, Hai Jin• 2025

Related benchmarks

Task	Dataset	Result
Science Question Answering	ScienceQA	--	791
Multimodal Evaluation	MME	Score1.59e+3	727
Visual Question Answering	ChartQA	--	519
Optical Character Recognition	OCRBench	Score815	433
Multimodal Understanding	MMStar	--	407
Diagram Understanding	AI2D	Accuracy92.71	317
Visual Question Answering	RealworldQA	Accuracy59.35	259
Reasoning Segmentation	ReasonSeg (test)	--	236
Multimodal Understanding	MMMU	MMMU Score47.56	232
Hallucination Evaluation	POPE	--	217

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord