Audio-Visual Segmentation
About
We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a novel method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-Visual Segmentation | AVSBench S4 v1 (test) | MJ78.7 | 55 | |
| Audio-Visual Segmentation | AVSBench MS3 v1 (test) | Mean Jaccard54 | 37 | |
| Audio-Visual Segmentation | AVSBench MS3 (test) | Jaccard Index (IoU)54 | 30 | |
| Audio-Visual Semantic Segmentation | AVSBench AVSS v1 (test) | MJ29.8 | 29 | |
| Audio-Visual Segmentation | AVSBench AVS-Objects-S4 | J&F Score83.3 | 21 | |
| Audio-Visual Segmentation | AVSBench AVS-Objects-MS3 | J & F Score59.3 | 21 | |
| Audio-Visual Segmentation | AVS-Object S4 | J&Fm83.3 | 19 | |
| Audio-Visual Segmentation | AVS-Object MS3 | J&Fm Combined Score59.3 | 19 | |
| Audio-Visual Segmentation | AVSBench S4 (test) | MJ78.7 | 16 | |
| Audio-Visual Segmentation | VPO-SS 1.0 (test) | J & FB Score44.63 | 16 |