Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

About

Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes. However, since the AVSS task requires the establishment of audio-visual correspondence and semantic understanding simultaneously, we observe that previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization. Therefore, we propose a two-stage training strategy called \textit{Stepping Stones}, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization. This training strategy has also proved its generalization and effectiveness on existing methods. To further improve the performance of AVS tasks, we propose a novel framework Adaptive Audio Visual Segmentation, in which we incorporate an adaptive audio query generator and integrate masked attention into the transformer decoder, facilitating the adaptive fusion of visual and audio features. Extensive experiments demonstrate that our methods achieve state-of-the-art results on all three AVS benchmarks. The project homepage can be accessed at https://gewu-lab.github.io/stepping_stones/.

Juncheng Ma, Peiwen Sun, Yaoting Wang, Di Hu• 2024

Related benchmarks

Task	Dataset	Result
Audio-Visual Segmentation	AVSBench AVS-Objects-MS3	J & F Score72.5	21
Audio-Visual Segmentation	AVSBench-Object MS3 (test)	Jaccard Index (J)67.4	21
Audio-Visual Segmentation	AVSBench AVS-Objects-S4	J&F Score87.3	21
Audio-Visual Segmentation	AVSBench Object S4 (test)	Jaccard Index (J)83.2	21
Audio-Visual Segmentation	AVS-Object MS3	J&Fm Combined Score72.5	19
Audio-Visual Segmentation	AVS-Object S4	J&Fm87.3	19
Audio-Visual Segmentation	AVSBench-object MS3 v1m (test)	mIoU67.3	16
Audio-Visual Segmentation	AVSBench-object S4 v1s (test)	mIoU83.2	16
Audio-Visual Segmentation	VPO-SS 1.0 (test)	J & FB Score68.54	16
Audio-Visual Semantic Segmentation	AVSBench-Semantic (AVSS) (test)	Jaccard Index (J)48.5	13

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord