Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts

About

Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.

Khanh Binh Nguyen, Chae Jung Park• 2026

Related benchmarks

TaskDatasetResultRank
Sound Source LocalizationFlickr SoundNet (test)
CIoU84.8
49
Audio-Visual SegmentationAVSBench MS3 (test)
Jaccard Index (IoU)40.19
38
Sound Source LocalizationVGG-SS (test)
CIoU54.76
30
Audio-Visual SegmentationAVSBench S4 (test)
F74
21
Visual Sound Source LocalizationVGG-SS extended (test)
Localization Accuracy54.76
20
Visual Sound Source LocalizationFlickr-SoundNet extended (test)
LocAcc84.8
20
Audio-visual localizationVGGSound (Heard 110 categories)
cIoU54.85
11
Audio-visual localizationVGGSound (Unheard 110 categories)
cIoU48.4
11
Showing 8 of 8 rows

Other info

Follow for update