Learning Correspondence from the Cycle-Consistency of Time

About

We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model learns a feature map representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation -- without finetuning -- across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods.

Xiaolong Wang, Allan Jabri, Alexei A. Efros• 2019

Related benchmarks

Task	Dataset	Result
Video Object Segmentation	DAVIS 2017 (val)	J mean46.4	1226
Semantic segmentation	VOC 2012 (val)	mIoU52.8	76
Pose Propagation	JHMDB	PCK@0.157.7	42
One-shot Video Object Segmentation	DAVIS 2016 (val)	J Mean55.8	28
Video label propagation	JHMDB (val)	PCK@0.157.3	17
Human Pose Tracking	JHMDB (val)	PCK@.157.3	15
Instance Segmentation Propagation	DAVIS 2017	J Mean46.4	14
Human Part Propagation	VIP (val)	mIoU28.9	12
Human Pose Tracking	JHMDB (split1)	PCK @ 0.157.3	11
One-shot Video Object Segmentation	DAVIS 2017 (val)	J&F Mean42.8	11

Showing 10 of 18 rows

Other info

Code

Follow for update

@wizwand_team Discord