Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Temporal Collection and Distribution for Referring Video Object Segmentation

About

Referring video object segmentation aims to segment a referent throughout a video sequence according to a natural language expression. It requires aligning the natural language expression with the objects' motions and their dynamic associations at the global video level but segmenting objects at the frame level. To achieve this goal, we propose to simultaneously maintain a global referent token and a sequence of object queries, where the former is responsible for capturing video-level referent according to the language expression, while the latter serves to better locate and segment objects with each frame. Furthermore, to explicitly capture object motions and spatial-temporal cross-modal reasoning over objects, we propose a novel temporal collection-distribution mechanism for interacting between the global referent token and object queries. Specifically, the temporal collection mechanism collects global information for the referent token from object queries to the temporal motions to the language expression. In turn, the temporal distribution first distributes the referent token to the referent sequence across all frames and then performs efficient cross-frame reasoning between the referent sequence and object queries in every frame. Experimental results show that our method outperforms state-of-the-art methods on all benchmarks consistently and significantly.

Jiajin Tang, Ge Zheng, Sibei Yang• 2023

Related benchmarks

TaskDatasetResultRank
Referring Video Object SegmentationRef-YouTube-VOS (val)
J&F Score65.8
200
Referring Video Object SegmentationRef-DAVIS 2017 (val)
J&F64.6
178
Referring Video Object SegmentationRef-DAVIS 17
J&F Score64.6
131
Referring Video SegmentationRef-YouTube-VOS
J&F Score65.8
91
Referring Video Object SegmentationRef-YouTube-VOS
J&F62.3
85
Referring Video Object SegmentationJHMDB Sentences (test)
Overall IoU0.706
83
Referring Video Object SegmentationDAVIS RVOS 2017 (val)
J&F Score64.6
16
Referring Video Object SegmentationA2D-Sentences (val)
Overall IoU76.6
11
Showing 8 of 8 rows

Other info

Follow for update