Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

About

Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement on the challenging $\textbf{MeViS}$ dataset. Code is available at https://github.com/heshuting555/DsHmp.

Shuting He, Henghui Ding• 2024

Related benchmarks

TaskDatasetResultRank
Referring Video Object SegmentationRef-YouTube-VOS (val)
J&F Score67.1
244
Referring Video Object SegmentationRef-DAVIS 2017 (val)
J&F64.9
205
Referring Video Object SegmentationMeViS (val)
J&F Score0.464
161
Referring Video Object SegmentationRef-DAVIS 17
J&F Score64.9
131
Video segmentation from a sentenceA2D Sentences (test)
Overall IoU81.1
122
Referring Video SegmentationRef-YouTube-VOS
J&F Score67.1
108
Referring Video Object SegmentationRef-YouTube-VOS
J&F67.1
85
Referring Video SegmentationMeViS
J&F Score46.4
81
Referring Video Object SegmentationA2D-Sentences
oIoU81.1
57
Referring Video Object SegmentationYoURVOS (test)
J&F21
40
Showing 10 of 22 rows

Other info

Code

Follow for update