Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mix and Localize: Localizing Sound Sources in Mixtures

About

We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods. Project site: https://hxixixh.github.io/mix-and-localize

Xixi Hu, Ziyang Chen, Andrew Owens• 2022

Related benchmarks

TaskDatasetResultRank
Multi-sound source localizationMUSIC-Duet (test)
CIoU@0.338.1
23
Multi-sound source localizationVGGSound-Duet (test)
CIoU@0.335.6
23
Single-source sound localizationVGGSound single-source (test)
IoU@0.545.6
23
Sound LocalizationMUSIC-Solo 1.0 (test)
IoU@0.559.9
22
Multi-source sound localizationVGGSound Instruments (test)
CIoU@0.184.5
13
Single-source sound localizationVGGSound Instruments (test)
IoU@0.359.3
13
Sound Source LocalizationFlickr-SoundNet
Precision55.83
10
Sound Source SegmentationAVSBench
mIoU31.69
10
Single Sound Source LocalizationMUSIC-Solo (test)
IoU@0.530.5
10
Multi-source sound localizationMUSIC-Duet
CIoU@0.326.5
9
Showing 10 of 15 rows

Other info

Follow for update