VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

About

Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions. We present VideoCutLER, a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. Our key insight is that using high-quality pseudo masks and a simple video synthesis method for model training is surprisingly sufficient to enable the resulting video model to effectively segment and track multiple instances across video frames. We show the first competitive unsupervised learning results on the challenging YouTubeVIS-2019 benchmark, achieving 50.7% APvideo^50 , surpassing the previous state-of-the-art by a large margin. VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks, exceeding DINO by 15.9% on YouTubeVIS-2019 in terms of APvideo.

Xudong Wang, Ishan Misra, Ziyun Zeng, Rohit Girdhar, Trevor Darrell• 2023

Related benchmarks

Task	Dataset	Result
Video Instance Segmentation	YouTube-VIS 2019 (val)	AP24.5	620
Video Instance Segmentation	YouTube-VIS 2021 (val)	AP32.4	372
Video Instance Segmentation	YouTube-VIS 2019	AP24.5	123
Video Instance Segmentation	YouTube-VIS 2021	AP18	99
Video segmentation	DAVIS	--	53
Video Instance Segmentation	OVIS	--	23
Video Semantic Segmentation	YouTube-VIS 2021	mAP17.1	7
Video Instance Segmentation	DAVIS All	J&F Score44.9	4
Video Instance Segmentation	UVO-Dense (val)	AP @ IoU=0.5013.5	3
Video Instance Segmentation	YouTube-VIS 2022	AP5031.7	2

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord