MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
About
Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | -- | 425 | |
| Audio-visual understanding | DailyOmni | Average Score53.82 | 69 | |
| Video-driven Audio Hallucination | AVHBench | Accuracy83.4 | 27 | |
| Cross-modal hallucination evaluation | AVHBench | Overall Accuracy88.19 | 22 | |
| Audiovisual Matching | AVHBench | Accuracy69.68 | 14 | |
| Cross-modal Hallucination Detection | Curse of Multi-Modalities (CMM) 1.0 (test) | VL Precision (pa)92.5 | 14 | |
| Audio Understanding | MMAU audio | Sound Score72.08 | 10 | |
| Multi-turn omni-modal dialog | OmniDialog audiovisual task | OmniDialog Score85.86 | 4 |