SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

About

Reference Audio-Visual Segmentation (Ref-AVS) aims to provide a pixel-wise scene understanding in Language-aided Audio-Visual Scenes (LAVS). This task requires the model to continuously segment objects referred to by text and audio from a video. Previous dual-modality methods always fail due to the lack of a third modality and the existing triple-modality method struggles with spatio-temporal consistency, leading to the target shift of different frames. In this work, we introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token to prompt and align SAM2 for achieving Ref-AVS in the LAVS. Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2, as well as token propagation and accumulation strategies designed to enhance spatio-temporal consistency without forgetting historical information. We conducted extensive experiments to demonstrate that SAM2-LOVE outperforms the SOTA by 8.5\% in $\mathcal{J\&F}$ on the Ref-AVS benchmark and showcase the simplicity and effectiveness of the components. Our code will be available here.

Yuji Wang, Haoran Xu, Yong Liu, Jiaze Li, Yansong Tang• 2025

Related benchmarks

Task	Dataset	Result
Referring Audio-Visual Segmentation	Ref-AVS	Seen Score23	30
Referring Audio-Visual Segmentation	Ref-AVS (unseen)	Jaccard Index (J)66.5	28
Referring Audio-Visual Segmentation	Ref-AVS (mix)	Jaccard Index (J)55	28
Referential Audio-Visual Segmentation	Ref-AVS (seen)	J & F Score0.477	28
Referring Audio-Visual Segmentation	Ref-AVS 1.0 (seen)	Jaccard Index43.5	12
Referring Audio-Visual Segmentation	Ref-AVS 1.0 (unseen)	J (Jaccard Index)66.5	12
Referring Audio-Visual Segmentation	Ref-AVS 1.0 (Mix (S+U))	Jaccard (J)55	12
Referring Audio-Visual Segmentation	Ref-AVS 1.0	S-score0.23	7

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord