Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Emergent Correspondence from Image Diffusion

About

Finding correspondences between images is a fundamental problem in computer vision. In this paper, we show that correspondence emerges in image diffusion models without any explicit supervision. We propose a simple strategy to extract this implicit knowledge out of diffusion networks as image features, namely DIffusion FeaTures (DIFT), and use them to establish correspondences between real images. Without any additional fine-tuning or supervision on the task-specific data or annotations, DIFT is able to outperform both weakly-supervised methods and competitive off-the-shelf features in identifying semantic, geometric, and temporal correspondences. Particularly for semantic correspondence, DIFT from Stable Diffusion is able to outperform DINO and OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k benchmark. It even outperforms the state-of-the-art supervised methods on 9 out of 18 categories while remaining on par for the overall performance. Project page: https://diffusionfeatures.github.io

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, Bharath Hariharan• 2023

Related benchmarks

TaskDatasetResultRank
Video Object SegmentationDAVIS 2017 (val)
J mean72.7
1193
Semantic CorrespondenceSPair-71k (test)--
125
Semantic CorrespondencePF-WILLOW
PCK@0.1 (bbox)85.1
109
Semantic CorrespondencePF-Pascal (test)
PCK@0.184.6
106
Semantic CorrespondencePF-PASCAL
PCK @ alpha=0.182.2
98
Homography EstimationHPatches
Overall Accuracy (< 1px)45.6
81
Point TrackingTAP-Vid DAVIS (First)
Delta Avg (<c)38.2
76
Point TrackingTAP-Vid Kinetics (First)
Avg Displacement Error (delta_avg)25.56
53
Point TrackingDAVIS TAP-Vid
Average Jaccard (AJ)21.51
52
Point TrackingTAP-Vid Kinetics
Overall Accuracy63.17
48
Showing 10 of 47 rows

Other info

Code

Follow for update