Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

About

The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel \textbf{Au}dio-aware query-enhanced \textbf{TR}ansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features. Furthermore, we devise an audio-aware query-enhanced transformer decoder that explicitly helps the model focus on the segmentation of the pinpointed sounding objects based on audio signals, while disregarding silent yet salient objects. Experimental results show that our method outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.

Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, Ya Zhang• 2023

Related benchmarks

Task	Dataset	Result
Audio-Visual Segmentation	AVSBench AVS-Objects-MS3	J & F Score72	21
Audio-Visual Segmentation	AVSBench AVS-Objects-S4	J&F Score82.1	21
Audio-Visual Segmentation	AVS-Object MS3	J&Fm Combined Score72	19
Audio-Visual Segmentation	AVS-Object S4	J&Fm82.1	19
Audio-Visual Segmentation	AVSBench-object S4 v1s (test)	mIoU80.4	16
Audio-Visual Segmentation	AVSBench-object MS3 v1m (test)	mIoU56.2	16

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord