Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition

About

Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.

Yuchen He, Jing Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Referring Audio-Visual SegmentationRef-AVS
Seen Score1.5
30
Referential Audio-Visual SegmentationRef-AVS (seen)
J & F Score0.688
28
Referring Audio-Visual SegmentationRef-AVS (mix)
Jaccard Index (J)68.9
28
Referring Audio-Visual SegmentationRef-AVS (unseen)
Jaccard Index (J)71.8
28
Showing 4 of 4 rows

Other info

Follow for update