Learning Correspondence from the Cycle-Consistency of Time
About
We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model learns a feature map representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation -- without finetuning -- across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | J mean46.4 | 1130 | |
| Semantic segmentation | VOC 2012 (val) | mIoU52.8 | 67 | |
| One-shot Video Object Segmentation | DAVIS 2016 (val) | J Mean55.8 | 28 | |
| Pose Propagation | JHMDB | PCK@0.157.7 | 20 | |
| Video label propagation | JHMDB (val) | PCK@0.157.3 | 17 | |
| Human Pose Tracking | JHMDB (val) | PCK@.157.3 | 15 | |
| Instance Segmentation Propagation | DAVIS 2017 | J Mean46.4 | 14 | |
| Human Part Propagation | VIP (val) | mIoU28.9 | 12 | |
| Human Pose Tracking | JHMDB (split1) | PCK @ 0.157.3 | 11 | |
| One-shot Video Object Segmentation | DAVIS 2017 (val) | J&F Mean42.8 | 11 |