Joint-task Self-supervised Learning for Temporal Correspondence

About

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions \emph{and} establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.

Xueting Li, Sifei Liu, Shalini De Mello, Xiaolong Wang, Jan Kautz, Ming-Hsuan Yang• 2019

Related benchmarks

Task	Dataset	Result
Video Object Segmentation	DAVIS 2017 (val)	J mean59.3	1251
Video Object Segmentation	DAVIS 2017 (test)	J (Jaccard Index)66.2	107
Video Object Segmentation	DAVIS 2017	Jaccard Index (J)66.2	82
Visual Object Tracking	OTB 2015 (test)	AUC Score59.2	47
Pose Propagation	JHMDB	PCK@0.158.6	42
Video label propagation	JHMDB (val)	PCK@0.158.6	17
Human Pose Tracking	JHMDB (val)	PCK@.158.6	15
Instance Segmentation Propagation	DAVIS 2017	J Mean57.7	14
Segment Propagation	DAVIS	J&Fm Score59.5	12
Human Part Propagation	VIP (val)	mIoU34.1	12

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord