SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
About
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code will be available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Video Object Segmentation | Ref-YouTube-VOS (val) | J&F Score67.3 | 200 | |
| Referring Video Object Segmentation | Ref-DAVIS 2017 (val) | J&F65.8 | 178 | |
| Referring Video Object Segmentation | Ref-DAVIS 17 | J&F Score64.2 | 131 | |
| Video segmentation from a sentence | A2D Sentences (test) | Overall IoU80.7 | 122 | |
| Referring Video Segmentation | Ref-YouTube-VOS | J&F Score67.3 | 91 | |
| Referring Video Object Segmentation | Ref-YouTube-VOS | J&F66 | 85 | |
| Referring Video Object Segmentation | JHMDB Sentences (test) | Overall IoU0.736 | 83 | |
| Referring Video Object Segmentation | A2D-Sentences | oIoU80.7 | 57 | |
| Referring Video Object Segmentation | JHMDB Sentences | Overall IoU73.6 | 56 | |
| Referring Video Segmentation | JHMDB Sentences (test) | mAP (0.5:0.95)44.6 | 35 |