MobileCLIP2: Improving Multi-Modal Reinforced Training

About

Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\times$ smaller and improves on DFN ViT-L/14 at 2.5$\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.

Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, Hadi Pouransari• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU37.9	3089
Image Classification	ImageNet 1k (test)	Top-1 Accuracy77.8	939
Text-to-Image Retrieval	DCI	R@153.04	117
Image-to-Text Retrieval	DCI	R@155.06	111
Image Classification	ImageNet Robustness Suite	Top-1 Accuracy (ImageNet-A)69	89
Text-to-Image Retrieval	DOCCI	Recall@179.65	66
Image-to-Text Retrieval	DOCCI	R@178	66
Monocular Depth Estimation	NYU v2 (val)	RMSE53.3	18
Text-to-Image Retrieval	MSCOCO	Recall@149.63	8
Image-to-Text Retrieval	Flickr30K	Recall@190.4	4

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord