Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models

About

Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at https://github.com/ASGO-MM/MCoT-hallucination.

Ji Ma, Wei Suo, Peng Wang, Yanning Zhang• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination	POPE Popular	F1 Score85.3	372
Visual Mathematical Reasoning	MathVista	Accuracy65	366
Object Hallucination	POPE Adversarial	Accuracy84.3	353
Object Hallucination	POPE (Random)	F1 Score85.6	324
Hallucination Evaluation	MMHal-Bench	MMHal Score4.12	306
Multimodal Reasoning	MMStar	Accuracy59.4	143
Hallucination Evaluation	Object-HalBench	CHAIR Score (s)16	78
Spatial Reasoning	VSR	LLM-Judge Accuracy84.3	28

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord