Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization
About
Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination | POPE Popular | F1 Score85.69 | 372 | |
| Object Hallucination | POPE Adversarial | Accuracy86.07 | 353 | |
| Object Hallucination Evaluation | POPE (Random) | Accuracy89.17 | 152 | |
| Multimodal Hallucination Evaluation | MMHal-Bench | Average Score4.01 | 129 | |
| Object Hallucination Evaluation | Object HalBench (test) | CHAIRS Score24.3 | 24 | |
| Discriminative Hallucination Evaluation | AMBER (test) | Accuracy86.8 | 18 | |
| Generative Hallucination Evaluation | AMBER (test) | CHAIR Score4.4 | 18 | |
| Object Hallucination | POPE | Accuracy (Random)90.73 | 12 | |
| Multimodal Understanding | VMCBench | Overall Score81.55 | 6 | |
| Multimodal Understanding | MME | Perception Total Score1.61e+3 | 6 |