DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
About
Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not readily aligned with language, hindering their adoption in open-vocabulary tasks. Our method, named dino.txt, unlocks this new ability for DINOv2, a widely used self-supervised visual encoder. We build upon the LiT training strategy, which trains a text encoder to align with a frozen vision model but leads to unsatisfactory results on dense tasks. We propose several key ingredients to improve performance on both global and dense tasks, such as concatenating the [CLS] token with the patch average to train the alignment and curating data using both text and image modalities. With these, we successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | Top-1 Acc81.6 | 1239 | |
| Semantic segmentation | ADE20K | mIoU25.1 | 1024 | |
| Semantic segmentation | Cityscapes | mIoU41 | 658 | |
| Image Classification | ImageNet A | Top-1 Acc83.2 | 654 | |
| Image Classification | ImageNet V2 | Top-1 Acc75.9 | 611 | |
| Semantic segmentation | COCO Stuff | mIoU24.1 | 379 | |
| Semantic segmentation | ADE20K | mIoU52.8 | 366 | |
| Image Classification | ObjectNet | Top-1 Accuracy74.5 | 219 | |
| 3D Semantic Segmentation | ScanNet V2 (val) | mIoU59.4 | 209 | |
| Semantic segmentation | Pascal Context 59 | mIoU36.7 | 204 |