Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

About

While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.

Ami Baid, Zihui Xue, Kristen Grauman• 2026

Related benchmarks

TaskDatasetResultRank
Video-driven Audio HallucinationAVHBench
Accuracy79.9
27
Audio Hallucination DetectionCMM
Audio-Language PA85
13
Audio-Visual CaptioningAVHBench
METEOR17.2
5
Vision-Audio-Language (VAL)CMM
PA Accuracy (Yes Instances)79.5
5
Showing 4 of 4 rows

Other info

Follow for update