Temporally Consistent Referring Video Object Segmentation with Hybrid Memory

About

Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.

Bo Miao, Mohammed Bennamoun, Yongsheng Gao, Mubarak Shah, Ajmal Mian• 2024

Related benchmarks

Task	Dataset	Result
Referring Video Object Segmentation	Ref-YouTube-VOS (val)	J&F Score67.1	244
Referring Video Object Segmentation	Ref-DAVIS 2017 (val)	J&F65.6	230
Referring Video Object Segmentation	MeViS (val)	J&F Score0.427	166
Referring Video Object Segmentation	Ref-DAVIS 17	J&F Score65.6	131
Referring Video Object Segmentation	Ref-YouTube-VOS	J&F67.5	103
Referring Video Object Segmentation	A2D-Sentences	oIoU80.1	61
Referring Video Object Segmentation	JHMDB Sentences	Overall IoU73.9	56
Referring Video Object Segmentation	YoURVOS (test)	J&F26.1	40

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord