Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

About

Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these LLMs is often prohibitively expensive, as it requires costly vision-language alignment retraining. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by the external reasoner to produce faithful and query-relevant captions. Together, this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon Decoupling (RAPID) approach. Empirical results show that RAPID achieves significant performance gains on multi-modal reasoning benchmarks. Crucially, RAPID enables a novel inference-time scaling paradigm: Once trained with VPO, the MLLM can be paired with any state-of-the-art LLM reasoner for consistent performance improvement without retraining.

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Multimodal Reasoning	MathVista	Accuracy76.8	258
Multimodal Reasoning	MMMU	Accuracy72.4	208
Multimodal Reasoning	WeMath	Accuracy52.1	171
Multimodal Reasoning	MathVision	Accuracy53.4	162
Multimodal Reasoning	LogicVista	Accuracy60.4	147
Multimodal Reasoning	MathVerse	Accuracy56.2	130
Multimodal Reasoning	DynaMath	Accuracy38.3	72

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord