When Vision Speaks for Sound

About

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.

Xiaofei Wen, Wenjie Jacky Mo, Xingyu Fu, Rui Cai, Tinghui Zhu, Wendi Li, Yanan Xie, Muhao Chen, Peng Qi• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	VideoMME	--	222
Video Understanding	LVB	Accuracy52.1	101
Audio-visual understanding	WorldSense	Accuracy50.3	72
Audio-visual understanding	Daily-Omni	Accuracy69	60
Temporal Grounding	Sync	Accuracy83.1	11
Temporal Grounding	VGGSync	Accuracy56.6	10

Showing 6 of 6 rows

Other info

GitHub

Follow for update

@wizwand_team Discord