Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

About

Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.

Jiawei Kong, Hao Fang, Shunxiang Liao, Jinyu Li, Bin Chen, Hao Wu, Shu-Tao Xia, Min Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Object HallucinationPOPE Popular
F1 Score85.69
372
Object HallucinationPOPE Adversarial
Accuracy86.07
353
Object Hallucination EvaluationPOPE (Random)
Accuracy89.17
152
Multimodal Hallucination EvaluationMMHal-Bench
Average Score4.01
129
Object Hallucination EvaluationObject HalBench (test)
CHAIRS Score24.3
24
Discriminative Hallucination EvaluationAMBER (test)
Accuracy86.8
18
Generative Hallucination EvaluationAMBER (test)
CHAIR Score4.4
18
Object HallucinationPOPE
Accuracy (Random)90.73
12
Multimodal UnderstandingVMCBench
Overall Score81.55
6
Multimodal UnderstandingMME
Perception Total Score1.61e+3
6
Showing 10 of 10 rows

Other info

Follow for update