Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One

About

A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are trained with distinct objectives, exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences, these models can be effectively merged into a unified model through multi-teacher distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One). This integrative approach not only surpasses the performance of individual teacher models but also amalgamates their distinctive features, such as zero-shot vision-language comprehension, detailed pixel-level understanding, and open vocabulary segmentation capabilities. In pursuit of the most hardware-efficient backbone, we evaluated numerous architectures in our multi-teacher distillation pipeline using the same training recipe. This led to the development of a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework. Code: https://github.com/NVlabs/RADIO

Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU52.84
2888
Object Hallucination EvaluationPOPE
Accuracy86.2
1455
Visual Question AnsweringVQA v2
Accuracy76.3
1362
Visual Question AnsweringTextVQA
Accuracy56.3
1285
Visual Question AnsweringGQA
Accuracy63
1249
Image ClassificationImageNet-1K
Top-1 Acc83.16
1239
Semantic segmentationADE20K
mIoU51.36
1024
Semantic segmentationCityscapes
mIoU78.4
658
Visual Question AnsweringChartQA
Accuracy15.7
371
Semantic segmentationADE20K
mIoU53
366
Showing 10 of 59 rows

Other info

Code

Follow for update