Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning

About

Recent breakthroughs in reasoning language models have significantly advanced text-based reasoning. On the other hand, Multi-modal Large Language Models (MLLMs) still lag behind, hindered by their outdated internal LLMs. Upgrading these LLMs is often prohibitively expensive, as it requires costly vision-language alignment retraining. To address this issue, we introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable. This approach redefines the MLLM's role to convert multi-modal inputs into detailed textual outputs that can be processed by any powerful, external, text-only LLM reasoners. To align the MLLM's perceptual output with the final reasoning task, we propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO). VPO rewards the MLLM based on the correctness of answers generated by the external reasoner to produce faithful and query-relevant captions. Together, this decoupling pipeline and VPO form our Reasoning-Aligned PerceptIon Decoupling (RAPID) approach. Empirical results show that RAPID achieves significant performance gains on multi-modal reasoning benchmarks. Crucially, RAPID enables a novel inference-time scaling paradigm: Once trained with VPO, the MLLM can be paired with any state-of-the-art LLM reasoner for consistent performance improvement without retraining.

Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Xin Jin, Zhenguo Li, James T. Kwok, Yu Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical Multimodal ReasoningMathVista
Accuracy76.8
218
Multimodal ReasoningMMMU
Accuracy72.4
130
Multimodal ReasoningWeMath
Accuracy52.1
129
Multimodal ReasoningMathVision
Accuracy53.4
102
Multimodal ReasoningLogicVista
Accuracy60.4
99
Multimodal ReasoningMathVerse
Accuracy56.2
84
Multimodal ReasoningDynaMath
Accuracy38.3
58
Showing 7 of 7 rows

Other info

Follow for update