Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective

About

Learning a good representation for space-time correspondence is the key for various computer vision tasks, including tracking object bounding boxes and performing video object pixel segmentation. To learn generalizable representation for correspondence in large-scale, a variety of self-supervised pretext tasks are proposed to explicitly perform object-level or patch-level similarity learning. Instead of following the previous literature, we propose to learn correspondence using Video Frame-level Similarity (VFS) learning, i.e, simply learning from comparing video frames. Our work is inspired by the recent success in image-level contrastive learning and similarity learning for visual recognition. Our hypothesis is that if the representation is good for recognition, it requires the convolutional features to find correspondence between similar objects or parts. Our experiments show surprising results that VFS surpasses state-of-the-art self-supervised approaches for both OTB visual object tracking and DAVIS video object segmentation. We perform detailed analysis on what matters in VFS and reveals new properties on image and frame level similarity learning. Project page with code is available at https://jerryxu.net/VFS

Jiarui Xu, Xiaolong Wang• 2021

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU31.4	3069
Video Object Segmentation	DAVIS 2017 (val)	J mean66.5	1226
Semantic segmentation	ADE20K	mIoU31.4	1028
Object Detection	COCO (val)	mAP41.6	637
Object Detection	LVIS (val)	mAP23.2	170
Visual Object Tracking	OTB-100	AUC52.5	154
Object Detection	COCO	mAP41.6	137
Point Tracking	TAP-Vid DAVIS (First)	Delta Avg (<c)38.4	76
Point Tracking	TAP-Vid Kinetics (First)	Avg Displacement Error (delta_avg)42.4	53
Pose Propagation	JHMDB	PCK@0.160.9	42

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord