Deep ViT Features as Dense Visual Descriptors
About
We study the use of deep features extracted from a pretrained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extractedfrom a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied directly to the features. Since our methods require no additional training nor data, they are readily applicable across a variety of domains. We show by extensive qualitative and quantitative evaluation that our simple methodologies achieve competitive results with recent state-of-the-art supervised methods, and outperform previous unsupervised methods by a large margin. Code is available in dino-vit-features.github.io.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic Correspondence | SPair-71k (test) | PCK@0.133.3 | 122 | |
| Semantic Correspondence | PF-PASCAL | PCK @ alpha=0.162.4 | 98 | |
| Video Object Segmentation | DAVIS | J Mean52.1 | 58 | |
| Unsupervised Object Discovery | COCO 20k | CorLoc57.99 | 56 | |
| Unsupervised Object Discovery | PASCAL VOC 2012 | CorLoc71.64 | 28 | |
| Unsupervised Object Discovery | PASCAL VOC 2007 | CorLoc68.27 | 28 | |
| Video Object Segmentation | DAVIS (val) | Mean J & F Score50.9 | 28 | |
| Novel View Synthesis | D-RE10K static regions only (test) | PSNR18.67 | 26 | |
| Novel View Synthesis | D-RE10K-iPhone full-image fidelity (test) | PSNR17.96 | 26 | |
| Semantic Matching | TSS | PCK (FG)64.7 | 24 |