Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

About

The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets. Project page: \href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}

Yaoting Wang, Peiwen Sun, Yuanchao Li, Honggang Zhang, Di Hu• 2024

Related benchmarks

TaskDatasetResultRank
Audio-Visual SegmentationAVSBench S4 v1 (test)
MJ83.3
55
Audio-Visual SegmentationAVSBench MS3 v1 (test)
Mean Jaccard66
37
Audio-Visual Semantic SegmentationAVSBench AVSS v1 (test)
MJ39
29
Audio-Visual SegmentationAVSBench-object MS3 v1m (test)
mIoU66
16
Audio-Visual SegmentationAVSBench-object S4 v1s (test)
mIoU83.2
16
Audio-Visual Semantic SegmentationAVSBench Semantic (test)
mIoU38.9
8
Showing 6 of 6 rows

Other info

Follow for update