Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DINOv3

About

Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

Oriane Sim\'eoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha\"el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth\'ee Darcet, Th\'eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Herv\'e J\'egou, Patrick Labatut, Piotr Bojanowski• 2025

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU55.9
3069
Visual Question AnsweringGQA
Accuracy65.9
1425
Image ClassificationImageNet-1K
Top-1 Acc85.77
1239
Semantic segmentationADE20K
mIoU55.9
1028
Image ClassificationCIFAR-10--
875
Image ClassificationImageNet V2
Top-1 Acc79.5
749
Semantic segmentationCityscapes
mIoU81.1
668
Image ClassificationStanford Cars
Accuracy84.2
660
Image ClassificationImageNet-1K
Top-1 Acc84.9
600
Image ClassificationFood-101
Accuracy88.1
570
Showing 10 of 469 rows
...

Other info

Follow for update