Mix and Localize: Localizing Sound Sources in Mixtures
About
We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods. Project site: https://hxixixh.github.io/mix-and-localize
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-sound source localization | MUSIC-Duet (test) | CIoU@0.338.1 | 23 | |
| Multi-sound source localization | VGGSound-Duet (test) | CIoU@0.335.6 | 23 | |
| Single-source sound localization | VGGSound single-source (test) | IoU@0.545.6 | 23 | |
| Sound Localization | MUSIC-Solo 1.0 (test) | IoU@0.559.9 | 22 | |
| Multi-source sound localization | VGGSound Instruments (test) | CIoU@0.184.5 | 13 | |
| Single-source sound localization | VGGSound Instruments (test) | IoU@0.359.3 | 13 | |
| Sound Source Localization | Flickr-SoundNet | Precision55.83 | 10 | |
| Sound Source Segmentation | AVSBench | mIoU31.69 | 10 | |
| Single Sound Source Localization | MUSIC-Solo (test) | IoU@0.530.5 | 10 | |
| Multi-source sound localization | MUSIC-Duet | CIoU@0.326.5 | 9 |