Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

About

Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.

Chen Liang, Yu Wu, Tianfei Zhou, Wenguan Wang, Zongxin Yang, Yunchao Wei, Yi Yang• 2021

Related benchmarks

Task	Dataset	Result
Referring Video Object Segmentation	Ref-YouTube-VOS (val)	J&F Score61.4	244
Referring Video Object Segmentation	Ref-DAVIS 2017 (val)	J&F56.4	240
Referring Video Segmentation	Ref-YouTube-VOS	J&F Score61.4	108
Referring Video Segmentation	Refer-Youtube-VOS (val)	J Index60	44
Referring Video Object Segmentation	Ref-Youtube-VOS v1.0 (test)	J&F Score56.4	33
Referring Video Object Segmentation	Refer-Youtube-VOS	J&F Score61.4	23
Referring Video Object Segmentation	Ref-Youtube-VOS 2019 (test)	J&F Score61.4	22
Referring Video Object Segmentation	Ref-YouTube-VOS (test)	J&F Score56.4	18
Referring Video Object Segmentation	YouTube-RVOS (val)	J&F Score61.4	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord