SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts

About

Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.

Khanh Binh Nguyen, Chae Jung Park• 2026

Related benchmarks

Task	Dataset	Result
Sound Source Localization	Flickr SoundNet (test)	CIoU84.8	49
Audio-Visual Segmentation	AVSBench MS3 (test)	Jaccard Index (IoU)40.19	38
Sound Source Localization	VGG-SS (test)	CIoU54.76	30
Audio-Visual Segmentation	AVSBench S4 (test)	F74	21
Visual Sound Source Localization	VGG-SS extended (test)	Localization Accuracy54.76	20
Visual Sound Source Localization	Flickr-SoundNet extended (test)	LocAcc84.8	20
Audio-visual localization	VGGSound (Heard 110 categories)	cIoU54.85	11
Audio-visual localization	VGGSound (Unheard 110 categories)	cIoU48.4	11

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord