Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model

About

Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, and (3) hierarchical clustering and sampling of training data--typically reserved for self-supervised learning--substantially improves sample efficiency over random sampling for multi-teacher distillation. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts. We release OpenLVD200M and distilled models.

Sofian Chaybouti, Sanath Narayan, Yasser Dahou, Ph\'uc H. L\^e Khac, Ankit Singh, Ngoc Dung Huynh, Wamiq Reyaz Para, Hilde Kuehne, Hakim Hacid• 2025

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU51.37
936
Semantic segmentationCityscapes
mIoU64.89
578
Text-to-Image RetrievalFlickr30K
R@181.2
460
Image ClassificationImageNet
Top-1 Accuracy82.78
429
Image-to-Text RetrievalFlickr30K
R@194.3
379
Text-to-Image RetrievalMSCOCO (5K)
R@153.98
42
Image-to-Text RetrievalMSCOCO (5K)
R@172.14
33
Semantic segmentationPascalVOC
mIoU84.4
18
Image-Text ClassificationImageNet 1k (test val)
Top-1 Acc80.17
11
Image-Text ClassificationCaltech-101
Top-1 Acc88.76
5
Showing 10 of 12 rows

Other info

Follow for update