Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

About

Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep vision-language understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose \textbf{ReferDINO}, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception and cross-modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding-guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object-consistent temporal enhancer that injects pretrained time-varying text features into inter-frame interaction to capture object-aware dynamic changes. Moreover, a confidence-aware query pruning strategy is designed to accelerate object decoding without compromising model performance. Extensive experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods (e.g., +3.9% (\mathcal{J}&\mathcal{F}) on Ref-YouTube-VOS) with real-time inference speed (51 FPS).

Tianming Liang, Kun-Yu Lin, Chaolei Tan, Jianguo Zhang, Wei-Shi Zheng, Jian-Fang Hu• 2025

Related benchmarks

TaskDatasetResultRank
Referring Video Object SegmentationRef-DAVIS 17
J&F Score68.9
131
Referring Video SegmentationRef-YouTube-VOS
J&F Score69.3
91
Referring Video Object SegmentationA2D-Sentences
oIoU82.1
57
Referring Video SegmentationMeViS
J&F Score49.3
50
Referring Video Object SegmentationJHMDB Sentences
mAP46.6
29
Video Object SegmentationOK-VOS
One-hop J&F25.1
13
Referring Video Object SegmentationLong-RVOS (val)
Static J&F52.5
8
Referring Video Object SegmentationLong-RVOS (test)
Static J&F50.9
8
Referring Video Object SegmentationRS-RVOS Bench 1.0 (test)
J&F56.2
8
Showing 9 of 9 rows

Other info

Code

Follow for update