More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

About

Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision. Across diverse multimodal benchmarks, PeRL-VL improves average Pass@1 accuracy from 63.3% (base Qwen2.5-VL-7B) to 68.8%, outperforming standard RLVR, text-only reasoning SFT, and naive multimodal distillation from GPT-4o.

Hoang Anh Just, Yifei Fan, Handong Zhao, Jiuxiang Gu, Ruiyi Zhang, Simon Jenni, Kushal Kafle, Ruoxi Jia, Jing Shi• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	DynaMath	Pass@151.01	25
Hallucination	HallusionBench	Pass@155.9	16
OCR-centric visual reasoning	OCRBench	Pass@186.56	13
General-purpose multiple-choice evaluation	MMBench EN V11 (dev)	Pass@184.29	13
Expert-level multidisciplinary QA	MMMU (dev val)	Pass@152.22	13
Visual Mathematical Reasoning	MathVista mini	Pass@167.05	13
Multimodal Understanding	MMVet	Pass@172.22	9
Multimodal Understanding	MMStar	Pass@1 Score59.95	9

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord