PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition

About

Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.

Yuchen He, Jing Zhang• 2026

Related benchmarks

Task	Dataset	Result
Referring Audio-Visual Segmentation	Ref-AVS	Seen Score1.5	30
Referential Audio-Visual Segmentation	Ref-AVS (seen)	J & F Score0.688	28
Referring Audio-Visual Segmentation	Ref-AVS (mix)	Jaccard Index (J)68.9	28
Referring Audio-Visual Segmentation	Ref-AVS (unseen)	Jaccard Index (J)71.8	28

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord