Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

About

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

Haozhe Wang, Qixin Xu, Changpeng Wang, Taofeng Xue, Chong Peng, Wenhu Chen, Fangzhen Lin• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista
Score80.3
474
Visual Mathematical ReasoningMathVista
Accuracy73.8
366
Massive Multi-discipline Multimodal UnderstandingMMMU--
216
Information Visual Question AnsweringInfoVQA
Accuracy87.8
110
Multi-modal ReasoningEMMA
Accuracy31.3
57
Document Visual Question AnsweringSlideVQA
Accuracy0.583
53
Visual PerceptionV*
Score89
42
Document UnderstandingDUDE
Accuracy47.6
32
Slide Question AnsweringSlideVQA
Overall Score59.6
29
High-resolution Visual UnderstandingHRBench
Score76
15
Showing 10 of 17 rows

Other info

Follow for update