SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

About

Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at https://github.com/Johanson-colab/SPD-Faith-Bench.

Weijiang Lv, Yaoxuan Feng, Xiaobo Xia, Jiayu Wang, Yan Jing, Wenchao Chen, Bo Chen• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Object Hallucination Evaluation	CHAIR	--	154
Multimodal Understanding	MME	Existence Score200	16
Faithful Perception	SPD-Faith Bench Multi-Difference Subset 1.0 (test)	--	12
Faithful Reasoning	SPD-Faith Bench Multi-Difference 1.0 (test)	--	12
Global Perception	SPD-Faith Bench Multi-Difference 1.0 (test)	--	12
Multimodal Reasoning	SPD-Faith Bench Easy 1.0	--	12
Multimodal Reasoning	SPD-Faith Bench Medium 1.0	--	12
Multimodal Reasoning	SPD-Faith Bench Hard 1.0	--	12
Faithfulness Evaluation	SPD-Faith Bench (test)	DS46.2	7

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord