DINOv2: Learning Robust Visual Features without Supervision

About

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

Maxime Oquab, Timoth\'ee Darcet, Th\'eo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv\'e Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU47.5	3089
Object Detection	COCO 2017 (val)	AP55.5	2930
Object Hallucination Evaluation	POPE	Accuracy86.24	2056
Visual Question Answering	VizWiz	Accuracy49.15	1863
Visual Question Answering	TextVQA	Accuracy15.1	1455
Visual Question Answering	GQA	Accuracy72.7	1445
Visual Question Answering	VQA v2	Accuracy76.7	1429
Instance Segmentation	COCO 2017 (val)	--	1304
Video Object Segmentation	DAVIS 2017 (val)	J mean64.8	1251
Image Classification	ImageNet-1K	Top-1 Acc86.2	1239

Showing 10 of 1103 rows

...

Other info

Code

Follow for update

@wizwand_team Discord