Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

About

Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is proposed to exchange sounding object-relevant information among multiple frames with the bridge of audio features. Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.

Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu• 2023

Related benchmarks

Task	Dataset	Result
Audio-Visual Segmentation	AVSBench S4 v1 (test)	MJ81.6	55
Audio-Visual Segmentation	AVSBench MS3 v1 (test)	Mean Jaccard61.1	37
Audio-Visual Segmentation	AVSBench-Object MS3 (test)	Jaccard Index (J)62.2	21
Audio-Visual Segmentation	AVSBench AVS-Objects-MS3	J & F Score67.5	21
Audio-Visual Segmentation	AVSBench AVS-Objects-S4	J&F Score85.5	21
Audio-Visual Segmentation	AVSBench Object S4 (test)	Jaccard Index (J)81.6	21
Audio-Visual Segmentation	AVS-Object MS3	J&Fm Combined Score67.5	19
Audio-Visual Segmentation	AVS-Object S4	J&Fm85.5	19
Audio-Visual Segmentation	AVSBench-object S4 v1s (test)	mIoU81.6	16
Audio-Visual Segmentation	AVSBench-object MS3 v1m (test)	mIoU61.1	16

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord