SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models

About

Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce SigLino, an efficient family of agglomerative vision foundation models that distill knowledge from SigLIP2 and DINOv3 simultaneously into Dense and Mixture-of-Experts students. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions without sacrificing performance, (3) hierarchical clustering and sampling of training data, typically reserved for self-supervised learning, substantially improves sample efficiency over random sampling for multi-teacher distillation, and (4) the resulting representations transfer effectively to early-fusion Grounding-VLMs, outperforming models trained from scratch. By combining these findings, we curate OpenLVD200M, a 200M-image corpus that demonstrates superior efficiency for multi-teacher distillation. Instantiated in a Mixture-of-Experts, our SigLino-MoE initializes an early-fusion Grounding-VLM that replaces the conventional ViT->LLM stack, demonstrating improved performance compared to a model trained from scratch. We release OpenLVD200M and five distilled checkpoints comprising MoE and dense variants.

Sofian Chaybouti, Sanath Narayan, Yasser Dahou, Ph\'uc H. L\^e Khac, Ankit Singh, Ngoc Dung Huynh, Wamiq Reyaz Para, Hilde Kuehne, Hakim Hacid• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU51.37	1028
Semantic segmentation	Cityscapes	mIoU64.89	668
Text-to-Image Retrieval	Flickr30K	R@181.2	559
Image-to-Text Retrieval	Flickr30K	R@194.3	451
Image Classification	ImageNet	Top-1 Accuracy82.78	431
Text-to-Image Retrieval	MSCOCO (5K)	R@153.98	51
Image-to-Text Retrieval	MSCOCO (5K)	R@172.14	42
Semantic segmentation	PascalVOC	mIoU84.4	20
Image-Text Classification	ImageNet 1k (test val)	Top-1 Acc80.17	11
Image-Text Classification	Caltech-101	Top-1 Acc88.76	5

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord