SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
About
Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sound Source Localization | Flickr SoundNet (test) | CIoU84.8 | 49 | |
| Audio-Visual Segmentation | AVSBench MS3 (test) | Jaccard Index (IoU)40.19 | 38 | |
| Sound Source Localization | VGG-SS (test) | CIoU54.76 | 30 | |
| Audio-Visual Segmentation | AVSBench S4 (test) | F74 | 21 | |
| Visual Sound Source Localization | VGG-SS extended (test) | Localization Accuracy54.76 | 20 | |
| Visual Sound Source Localization | Flickr-SoundNet extended (test) | LocAcc84.8 | 20 | |
| Audio-visual localization | VGGSound (Heard 110 categories) | cIoU54.85 | 11 | |
| Audio-visual localization | VGGSound (Unheard 110 categories) | cIoU48.4 | 11 |