Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

About

Audio-Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio-visual misalignment, which lead to biases toward louder or larger objects while overlooking weaker, smaller, or co-occurring sources. To address these challenges, we propose DDAVS, a Disentangled Audio Semantics and Delayed Bidirectional Alignment framework. To mitigate multi-source entanglement, DDAVS employs learnable queries to extract audio semantics and anchor them within a structured semantic space derived from an audio prototype memory bank. This is further optimized through contrastive learning to enhance discriminability and robustness. To alleviate audio-visual misalignment, DDAVS introduces dual cross-attention with delayed modality interaction, improving the robustness of multimodal alignment. Extensive experiments on the AVS-Objects and VPO benchmarks demonstrate that DDAVS consistently outperforms existing approaches, exhibiting strong performance across single-source, multi-source, and multi-instance scenarios. These results validate the effectiveness and generalization ability of our framework under challenging real-world audio-visual segmentation conditions. Project page: https://trilarflagz.github.io/DDAVS-page/

Jingqi Tian, Yiheng Du, Haoji Zhang, Yuji Wang, Isaac Ning Lee, Xulong Bai, Tianrui Zhu, Jingxuan Niu, Yansong Tang• 2025

Related benchmarks

TaskDatasetResultRank
Audio-Visual SegmentationAVSBench AVS-Objects-S4
J&F Score92.4
21
Audio-Visual SegmentationAVSBench AVS-Objects-MS3
J & F Score75.1
21
Audio-Visual SegmentationVPO-SS 1.0 (test)
J & FB Score74.8
16
Audio-Visual SegmentationAVSBench AVS-Semantic
J (Jaccard)49.7
13
Audio-Visual SegmentationVPO-MS
J&F Score76.11
8
Audio-Visual SegmentationVPO-MSMI
J&F Score72.84
8
Showing 6 of 6 rows

Other info

Follow for update