Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scalable Video Object Segmentation with Simplified Framework

About

The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching. However, the above hand-crafted designs empirically cause insufficient target interaction, thus limiting the dynamic target-aware feature learning in VOS. To tackle these limitations, this paper presents a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. More importantly, SimVOS could directly apply well-pretrained ViT backbones (e.g., MAE) for VOS, which bridges the gap between VOS and large-scale self-supervised pre-training. To achieve a better performance-speed trade-off, we further explore within-frame attention and propose a new token refinement module to improve the running speed and save computational cost. Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks, i.e., DAVIS-2017 (88.0% J&F), DAVIS-2016 (92.9% J&F) and YouTube-VOS 2019 (84.2% J&F), without applying any synthetic video or BL30K pre-training used in previous VOS approaches.

Qiangqiang Wu, Tianyu Yang, Wei WU, Antoni Chan• 2023

Related benchmarks

TaskDatasetResultRank
Video Object SegmentationDAVIS 2017 (val)--
1130
Video Object SegmentationYouTube-VOS 2019 (val)--
231
Video Object SegmentationSA-V (val)
J&F Score44.2
74
Video Object SegmentationSA-V (test)
J&F44.1
70
Semi-supervised Video Object SegmentationDAVIS 2017 (val)
J&F Score88
31
Video Object SegmentationHardware Efficiency Benchmark
FPS3.3
21
Semi-supervised Video Object SegmentationDAVIS 17 (test-dev)
J&F Score80.4
17
Semi-supervised Video Object SegmentationYTVOS 2019 (val)
Overall Jaccard (G)84.2
17
Semi-supervised Video Object SegmentationSA-V (val)
J&F Score44.2
15
Semi-supervised Video Object SegmentationSA-V (test)
J&F Score44.1
15
Showing 10 of 10 rows

Other info

Follow for update