Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CPM: Class-conditional Prompting Machine for Audio-visual Segmentation

About

Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy.

Yuanhong Chen, Chong Wang, Yuyuan Liu, Hu Wang, Gustavo Carneiro• 2024

Related benchmarks

TaskDatasetResultRank
Audio-Visual SegmentationAVSBench S4 v1 (test)
MJ81.4
55
Audio-Visual SegmentationAVSBench MS3 v1 (test)
Mean Jaccard59.8
37
Audio-Visual Semantic SegmentationAVSBench AVSS v1 (test)
MJ34.5
29
Audio-Visual SegmentationVPO-SS 1.0 (test)
J & FB Score73.49
16
Audio-Visual SegmentationVPO-MSMI 1.0 (test)
J & FB Score68.07
8
Audio-Visual SegmentationVPO-MS 1.0 (test)
J & FB Score72.91
8
Showing 6 of 6 rows

Other info

Follow for update