Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Localizing Visual Sounds the Hard Way

About

The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous existing ones, contains 5K videos spanning over 200 categories, and, differently from Flickr SoundNet, is video-based. On VGG-SS, we also show that our algorithm achieves state-of-the-art performance against several baselines.

Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman• 2021

Related benchmarks

TaskDatasetResultRank
Audio-Visual SegmentationAVSBench S4 v1 (test)
MJ37.9
55
Audio-Visual SegmentationAVSBench MS3 (test)
Jaccard Index (IoU)29.5
30
Single-source sound localizationVGGSound single-source (test)
IoU@0.544.6
23
Multi-sound source localizationMUSIC-Duet (test)
CIoU@0.333.1
23
Multi-sound source localizationVGGSound-Duet (test)
CIoU@0.331.8
23
Sound Target SegmentationAVSBench-object MS3 1.0 (test)
mIoU29.5
23
Sound LocalizationMUSIC-Solo 1.0 (test)
IoU@0.557.1
22
Visual Sound Source LocalizationVGG-SS (test)
LocAcc33.36
19
Visual Sound Source LocalizationFlickr SoundNet (test)
LocAcc71.6
18
Sound Source LocalizationFlickr SoundNet 10k (test)
AP68.92
17
Showing 10 of 31 rows

Other info

Follow for update