Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

About

Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.

Jiawei Kong, Hao Fang, Shunxiang Liao, Jinyu Li, Bin Chen, Hao Wu, Shu-Tao Xia, Min Zhang• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination	POPE Popular	Accuracy86.8	406
Object Hallucination	POPE Adversarial	Accuracy86.07	367
Object Hallucination Evaluation	POPE (Random)	Accuracy89.17	152
Multimodal Hallucination Evaluation	MMHal-Bench	Average Score4.01	140
Discriminative Hallucination Evaluation	AMBER (test)	Accuracy86.8	42
Object Hallucination	POPE	F1 Score (Random)90.07	25
Object Hallucination Evaluation	Object HalBench (test)	CHAIRS Score24.3	24
Multimodal Understanding	VMCBench	Overall Score81.55	22
Generative Hallucination Evaluation	AMBER (test)	CHAIR Score4.4	18
Multimodal Understanding	MME	Perception Total Score1.61e+3	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord