CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment
About
Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-Visual Question Answering | MUSIC-AVQA (test) | Acc (Avg)76.7 | 59 | |
| Audio-Visual Event Localization | AVE (test) | Accuracy82.6 | 37 | |
| Audio-Visual Segmentation | AVSBench S4 (test) | MJ81.9 | 16 | |
| Audio-Visual Video Parsing | LLP (test) | Audio Segment Score63.8 | 11 | |
| Audio-Visual Segmentation | AVSBench MS3 setting (test) | MJ Score55.1 | 6 |