Joint-task Self-supervised Learning for Temporal Correspondence
About
This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions \emph{and} establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | J mean59.3 | 1130 | |
| Video Object Segmentation | DAVIS 2017 (test) | J (Jaccard Index)66.2 | 107 | |
| Visual Object Tracking | OTB 2015 (test) | AUC Score59.2 | 47 | |
| Video Object Segmentation | DAVIS 2017 | Jaccard Index (J)66.2 | 42 | |
| Pose Propagation | JHMDB | PCK@0.158.6 | 20 | |
| Video label propagation | JHMDB (val) | PCK@0.158.6 | 17 | |
| Human Pose Tracking | JHMDB (val) | PCK@.158.6 | 15 | |
| Instance Segmentation Propagation | DAVIS 2017 | J Mean57.7 | 14 | |
| Human Part Propagation | VIP (val) | mIoU34.1 | 12 | |
| Segment Propagation | DAVIS | J&Fm Score59.5 | 7 |