Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

About

Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adhere to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach's effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at \url{https://github.com/aspirinone/CATR.github.io}

Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, Jun Xiao• 2023

Related benchmarks

TaskDatasetResultRank
Audio-Visual SegmentationAVSBench S4 v1 (test)
MJ81.4
55
Audio-Visual SegmentationAVSBench MS3 v1 (test)
Mean Jaccard59
37
Audio-Visual SegmentationAVSBench MS3 (test)
Jaccard Index (IoU)52.8
30
Audio-Visual Semantic SegmentationAVSBench AVSS v1 (test)
MJ32.8
29
Sound Target SegmentationAVSBench-object MS3 1.0 (test)
mIoU59
23
Audio-Visual SegmentationAVSBench AVS-Objects-S4
J&F Score87.9
21
Audio-Visual SegmentationAVSBench AVS-Objects-MS3
J & F Score68.6
21
Audio-Visual SegmentationAVS-Object S4
J&Fm87.9
19
Audio-Visual SegmentationAVS-Object MS3
J&Fm Combined Score68.6
19
Audio-Visual SegmentationAVSBench-object S4 v1s (test)
mIoU81.4
16
Showing 10 of 17 rows

Other info

Follow for update