Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

About

Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,104 videos and 61,095 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.

Kaining Ying, Henghui Ding, Guangquan Jie, Yu-Gang Jiang• 2025

Related benchmarks

TaskDatasetResultRank
Referring Audio-Visual SegmentationRef-AVS
Seen Score9.8
30
Referential Audio-Visual SegmentationRef-AVS (seen)
J & F Score0.552
28
Referring Audio-Visual SegmentationRef-AVS (mix)
Jaccard Index (J)54.5
28
Referring Audio-Visual SegmentationRef-AVS (unseen)
Jaccard Index (J)58.3
28
Showing 4 of 4 rows

Other info

Follow for update