CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

About

Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.

Yunzuo Hu, Wen Li, Jing Zhang• 2026

Related benchmarks

Task	Dataset	Result
Audio-Visual Video Parsing	LLP (test)	Audio Segment Score63.8	89
Audio-Visual Question Answering	MUSIC-AVQA (test)	Acc (Avg)76.7	76
Audio-Visual Event Localization	AVE (test)	Accuracy82.6	54
Audio-Visual Segmentation	AVSBench S4 (test)	--	21
Audio-Visual Segmentation	AVSBench MS3 setting (test)	MJ Score55.1	6

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord