Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

About

The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel \textbf{Au}dio-aware query-enhanced \textbf{TR}ansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features. Furthermore, we devise an audio-aware query-enhanced transformer decoder that explicitly helps the model focus on the segmentation of the pinpointed sounding objects based on audio signals, while disregarding silent yet salient objects. Experimental results show that our method outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.

Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, Ya Zhang• 2023

Related benchmarks

TaskDatasetResultRank
Audio-Visual SegmentationAVSBench AVS-Objects-MS3
J & F Score72
21
Audio-Visual SegmentationAVSBench AVS-Objects-S4
J&F Score82.1
21
Audio-Visual SegmentationAVS-Object MS3
J&Fm Combined Score72
19
Audio-Visual SegmentationAVS-Object S4
J&Fm82.1
19
Audio-Visual SegmentationAVSBench-object S4 v1s (test)
mIoU80.4
16
Audio-Visual SegmentationAVSBench-object MS3 v1m (test)
mIoU56.2
16
Showing 6 of 6 rows

Other info

Follow for update