Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Joint-task Self-supervised Learning for Temporal Correspondence

About

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions \emph{and} establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.

Xueting Li, Sifei Liu, Shalini De Mello, Xiaolong Wang, Jan Kautz, Ming-Hsuan Yang• 2019

Related benchmarks

TaskDatasetResultRank
Video Object SegmentationDAVIS 2017 (val)
J mean59.3
1130
Video Object SegmentationDAVIS 2017 (test)
J (Jaccard Index)66.2
107
Visual Object TrackingOTB 2015 (test)
AUC Score59.2
47
Video Object SegmentationDAVIS 2017
Jaccard Index (J)66.2
42
Pose PropagationJHMDB
PCK@0.158.6
20
Video label propagationJHMDB (val)
PCK@0.158.6
17
Human Pose TrackingJHMDB (val)
PCK@.158.6
15
Instance Segmentation PropagationDAVIS 2017
J Mean57.7
14
Human Part PropagationVIP (val)
mIoU34.1
12
Segment PropagationDAVIS
J&Fm Score59.5
7
Showing 10 of 14 rows

Other info

Follow for update