Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

About

Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.

Yunzuo Hu, Wen Li, Jing Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Audio-Visual Question AnsweringMUSIC-AVQA (test)
Acc (Avg)76.7
59
Audio-Visual Event LocalizationAVE (test)
Accuracy82.6
37
Audio-Visual SegmentationAVSBench S4 (test)
MJ81.9
16
Audio-Visual Video ParsingLLP (test)
Audio Segment Score63.8
11
Audio-Visual SegmentationAVSBench MS3 setting (test)
MJ Score55.1
6
Showing 5 of 5 rows

Other info

Follow for update