Semantic-Aware Fine-Grained Correspondence
About
Establishing visual correspondence across images is a challenging and essential task. Recently, an influx of self-supervised methods have been proposed to better learn representations for visual correspondence. However, we find that these methods often fail to leverage semantic information and over-rely on the matching of low-level features. In contrast, human vision is capable of distinguishing between distinct objects as a pretext to tracking. Inspired by this paradigm, we propose to learn semantic-aware fine-grained correspondence. Firstly, we demonstrate that semantic correspondence is implicitly available through a rich set of image-level self-supervised methods. We further design a pixel-level self-supervised learning objective which specifically targets fine-grained correspondence. For downstream tasks, we fuse these two kinds of complementary correspondence representations together, demonstrating that they boost performance synergistically. Our method surpasses previous state-of-the-art self-supervised methods using convolutional networks on a variety of visual correspondence tasks, including video object segmentation, human pose tracking, and human part tracking.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | J mean68.3 | 1130 | |
| Video label propagation | JHMDB (val) | PCK@0.161.9 | 17 | |
| Human Pose Tracking | JHMDB (val) | PCK@.159.3 | 15 | |
| Human Part Propagation | VIP (val) | mIoU34 | 12 | |
| Video label propagation | PerMIS Video | J&F Score73.2 | 7 | |
| Video label propagation | DAVIS 2017 (val) | J&F Score71.2 | 7 |