Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

About

Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.

Rheeya Uppaal, Phu Mon Htut, Min Bai, Nikolaos Pappas, Zheng Qi, Sandesh Swamy• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal ReasoningHallusionBench
Accuracy0.643
42
Multimodal ReasoningMMEvalPro
Accuracy82.8
15
Multimodal ReasoningMMVP (test)
UPR0.031
6
Hallucination DetectionMMEvalPro perception--
5
Multimodal ReasoningMMVP--
5
Showing 5 of 5 rows

Other info

Follow for update