MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

About

Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.

Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, Faqiang Qian, Yichao Wu• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	MM-Vet	MM-Vet Score68.12	517
Mathematical Reasoning	MathVista	Score69.04	474
Chart Question Answering	ChartQA	Accuracy87.33	371
Visual Perception	BLINK	--	241
Mathematical Reasoning	WeMath	Accuracy36.88	225
Multimodal Reasoning	MMStar	--	143
Multimodal Reasoning	MMBench	--	127

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord