Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AVSegFormer: Audio-Visual Segmentation with Transformer

About

The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture. Specifically, we introduce audio queries and learnable queries into the transformer decoder, enabling the network to selectively attend to interested visual features. Besides, we present an audio-visual mixer, which can dynamically adjust visual features by amplifying relevant and suppressing irrelevant spatial channels. Additionally, we devise an intermediate mask loss to enhance the supervision of the decoder, encouraging the network to produce more accurate intermediate predictions. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.

Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu• 2023

Related benchmarks

TaskDatasetResultRank
Audio-Visual SegmentationAVSBench S4 v1 (test)
MJ82.1
55
Audio-Visual SegmentationAVSBench MS3 v1 (test)
Mean Jaccard58.4
37
Audio-Visual SegmentationAVSBench MS3 (test)
Jaccard Index (IoU)58.4
30
Audio-Visual Semantic SegmentationAVSBench AVSS v1 (test)
MJ36.7
29
Sound Target SegmentationAVSBench-object MS3 1.0 (test)
mIoU58.4
23
Audio-Visual SegmentationAVSBench AVS-Objects-S4
J&F Score86.8
21
Audio-Visual SegmentationAVSBench AVS-Objects-MS3
J & F Score67.2
21
Audio-Visual SegmentationAVS-Object S4
J&Fm86.8
19
Audio-Visual SegmentationAVS-Object MS3
J&Fm Combined Score67.2
19
Audio-Visual SegmentationAVSBench-object S4 v1s (test)
mIoU82.1
16
Showing 10 of 36 rows

Other info

Follow for update