AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model
About
Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce Agglomerative Mixture-of-Experts Vision Foundation Models (AMoE), which distill knowledge from SigLIP2 and DINOv3 simultaneously into a Mixture-of-Experts student. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, and (3) hierarchical clustering and sampling of training data--typically reserved for self-supervised learning--substantially improves sample efficiency over random sampling for multi-teacher distillation. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts. We release OpenLVD200M and distilled models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K | mIoU51.37 | 936 | |
| Semantic segmentation | Cityscapes | mIoU64.89 | 578 | |
| Text-to-Image Retrieval | Flickr30K | R@181.2 | 460 | |
| Image Classification | ImageNet | Top-1 Accuracy82.78 | 429 | |
| Image-to-Text Retrieval | Flickr30K | R@194.3 | 379 | |
| Text-to-Image Retrieval | MSCOCO (5K) | R@153.98 | 42 | |
| Image-to-Text Retrieval | MSCOCO (5K) | R@172.14 | 33 | |
| Semantic segmentation | PascalVOC | mIoU84.4 | 18 | |
| Image-Text Classification | ImageNet 1k (test val) | Top-1 Acc80.17 | 11 | |
| Image-Text Classification | Caltech-101 | Top-1 Acc88.76 | 5 |