Localizing Visual Sounds the Easy Way
About
Unsupervised audio-visual source localization aims at localizing visible sound sources in a video without relying on ground-truth localization for training. Previous works often seek high audio-visual similarities for likely positive (sounding) regions and low similarities for likely negative regions. However, accurately distinguishing between sounding and non-sounding regions is challenging without manual annotations. In this work, we propose a simple yet effective approach for Easy Visual Sound Localization, namely EZ-VSL, without relying on the construction of positive and/or negative regions during training. Instead, we align audio and visual spaces by seeking audio-visual representations that are aligned in, at least, one location of the associated image, while not matching other images, at any location. We also introduce a novel object guided localization scheme at inference time for improved precision. Our simple and effective framework achieves state-of-the-art performance on two popular benchmarks, Flickr SoundNet and VGG-Sound Source. In particular, we improve the CIoU of the Flickr SoundNet test set from 76.80% to 83.94%, and on the VGG-Sound Source dataset from 34.60% to 38.85%. The code is available at https://github.com/stoneMo/EZ-VSL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-Visual Segmentation | AVSBench MS3 v1 (test) | Mean Jaccard23.58 | 37 | |
| Sound Source Localization | Flickr SoundNet (test) | CIoU83.94 | 28 | |
| Single-source sound localization | VGGSound single-source (test) | IoU@0.545.1 | 23 | |
| Multi-sound source localization | MUSIC-Duet (test) | CIoU@0.334.3 | 23 | |
| Multi-sound source localization | VGGSound-Duet (test) | CIoU@0.332.4 | 23 | |
| Sound Localization | MUSIC-Solo 1.0 (test) | IoU@0.558.7 | 22 | |
| Visual Sound Source Localization | VGG-SS (test) | LocAcc38.85 | 19 | |
| Visual Sound Source Localization | Flickr SoundNet (test) | LocAcc83.94 | 18 | |
| Sound Source Localization | Flickr SoundNet 10k (test) | AP84.56 | 17 | |
| Audio-visual localization | VGG-SS Open set (Unheard 110) | AP38.19 | 14 |