Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching
About
Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-sound source localization | MUSIC-Duet (test) | CIoU@0.338.8 | 23 | |
| Multi-sound source localization | VGGSound-Duet (test) | CIoU@0.336.9 | 23 | |
| Single-source sound localization | VGGSound single-source (test) | IoU@0.546.8 | 23 | |
| Sound Localization | MUSIC-Solo 1.0 (test) | IoU@0.562.7 | 22 | |
| Visual Sound Source Localization | VGG-SS (test) | LocAcc29.91 | 19 | |
| Visual Sound Source Localization | Flickr SoundNet (test) | LocAcc74 | 18 | |
| Multi-source sound localization | VGGSound Instruments (test) | CIoU@0.185.9 | 13 | |
| Single-source sound localization | VGGSound Instruments (test) | IoU@0.361.6 | 13 | |
| Visual Sound Source Localization | Flickr-SoundNet extended (test) | LocAcc72.91 | 11 | |
| Visual Sound Source Localization | VGG-SS extended (test) | Localization Accuracy26.87 | 11 |