Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self-supervised Learning for Video Correspondence Flow

About

The objective of this paper is self-supervised learning of feature embeddings that are suitable for matching correspondences along the videos, which we term correspondence flow. By leveraging the natural spatial-temporal coherence in videos, we propose to train a ``pointer'' that reconstructs a target frame by copying pixels from a reference frame. We make the following contributions: First, we introduce a simple information bottleneck that forces the model to learn robust features for correspondence matching, and prevent it from learning trivial solutions, \eg matching based on low-level colour information. Second, to tackle the challenges from tracker drifting, due to complex object deformations, illumination changes and occlusions, we propose to train a recursive model over long temporal windows with scheduled sampling and cycle consistency. Third, we achieve state-of-the-art performance on DAVIS 2017 video segmentation and JHMDB keypoint tracking tasks, outperforming all previous self-supervised learning approaches by a significant margin. Fourth, in order to shed light on the potential of self-supervised learning on the task of video correspondence flow, we probe the upper bound by training on additional data, \ie more diverse videos, further demonstrating significant improvements on video segmentation.

Zihang Lai, Weidi Xie• 2019

Related benchmarks

TaskDatasetResultRank
Video Object SegmentationDAVIS 2017 (val)
J mean48.4
1130
Video Object SegmentationYouTube-VOS 2018 (val)
J Score (Seen)50.6
493
Video Object SegmentationYouTube-VOS 2019 (val)
J-Score (Seen)51.2
231
Video Object SegmentationDAVIS 2017 (test)
J (Jaccard Index)54.2
107
Video Object SegmentationDAVIS 2017
Jaccard Index (J)54.2
42
One-shot Video Object SegmentationDAVIS 2016 (val)
J Mean48.9
28
Video label propagationJHMDB (val)
PCK@0.158.5
17
Instance Segmentation PropagationDAVIS 2017
J Mean47.7
14
One-shot Video Object SegmentationDAVIS 2017 (val)
J&F Mean50.3
11
Pose Keypoint PropagationJHMDB split 1 (val)--
10
Showing 10 of 10 rows

Other info

Code

Follow for update