Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

About

Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.

Rheeya Uppaal, Phu Mon Htut, Min Bai, Nikolaos Pappas, Zheng Qi, Sandesh Swamy• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	HallusionBench	Accuracy0.643	42
Multimodal Reasoning	MMVP	Accuracy39.7	26
Multimodal Reasoning	MMEvalPro	Accuracy82.8	15
Multimodal Reasoning	MMVP (test)	UPR0.031	6
Hallucination Detection	MMEvalPro perception	--	5

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord