Sound Source Localization is All about Cross-Modal Alignment
About
Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective. However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for genuine sound source localization. Cross-modal semantic understanding is important in understanding semantically mismatched audio-visual events, e.g., silent objects, or off-screen sounds. To account for this, we propose a cross-modal alignment task as a joint task with sound source localization to better learn the interaction between audio and visual modalities. Thereby, we achieve high localization performance with strong cross-modal semantic understanding. Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval. Our work suggests that jointly tackling both tasks is necessary to conquer genuine sound source localization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sound Source Localization | Flickr SoundNet (test) | CIoU82.4 | 28 | |
| Single-source sound localization | VGGSound single-source (test) | IoU@0.550.8 | 23 | |
| Multi-sound source localization | MUSIC-Duet (test) | CIoU@0.338.3 | 23 | |
| Multi-sound source localization | VGGSound-Duet (test) | CIoU@0.335.8 | 23 | |
| Sound Localization | MUSIC-Solo 1.0 (test) | IoU@0.566.4 | 22 | |
| Single-source sound localization | VGGSound Instruments (test) | IoU@0.365.8 | 13 | |
| Multi-source sound localization | VGGSound Instruments (test) | CIoU@0.184.9 | 13 | |
| Audio referred image grounding | VGG-SS (test) | cIoU42.64 | 10 | |
| Audio referred image grounding | PascalSound (test) | cIoU58.34 | 10 | |
| Audio referred image grounding | AVSBench (test) | cIoU71.57 | 10 |