MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning
About
Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Reasoning | MM-Vet | MM-Vet Score68.12 | 431 | |
| Mathematical Reasoning | MathVista | Score69.04 | 385 | |
| Chart Question Answering | ChartQA | Accuracy87.33 | 356 | |
| Mathematical Reasoning | WeMath | Accuracy36.88 | 161 | |
| Multimodal Reasoning | MMStar | -- | 143 | |
| Visual Perception | BLINK | -- | 122 | |
| Multimodal Reasoning | MMBench | Overall Score83.36 | 78 |