ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
About
Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep vision-language understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose \textbf{ReferDINO}, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception and cross-modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding-guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object-consistent temporal enhancer that injects pretrained time-varying text features into inter-frame interaction to capture object-aware dynamic changes. Moreover, a confidence-aware query pruning strategy is designed to accelerate object decoding without compromising model performance. Extensive experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods (e.g., +3.9% (\mathcal{J}&\mathcal{F}) on Ref-YouTube-VOS) with real-time inference speed (51 FPS).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Video Object Segmentation | Ref-DAVIS 17 | J&F Score68.9 | 131 | |
| Referring Video Segmentation | Ref-YouTube-VOS | J&F Score69.3 | 91 | |
| Referring Video Object Segmentation | A2D-Sentences | oIoU82.1 | 57 | |
| Referring Video Segmentation | MeViS | J&F Score49.3 | 50 | |
| Referring Video Object Segmentation | JHMDB Sentences | mAP46.6 | 29 | |
| Video Object Segmentation | OK-VOS | One-hop J&F25.1 | 13 | |
| Referring Video Object Segmentation | Long-RVOS (val) | Static J&F52.5 | 8 | |
| Referring Video Object Segmentation | Long-RVOS (test) | Static J&F50.9 | 8 | |
| Referring Video Object Segmentation | RS-RVOS Bench 1.0 (test) | J&F56.2 | 8 |