Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

About

Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies discovered that CLIP can only encode one aspect of the feature space, leading to substantial information loss and indistinctive features. To mitigate this issue, this paper introduces a novel strategy that fine-tunes a series of complementary CLIP models and transforms them into a CLIP-MoE. Specifically, we propose a model-agnostic Diversified Multiplet Upcycling (DMU) framework for CLIP. Instead of training multiple CLIP models from scratch, DMU leverages a pre-trained CLIP and fine-tunes it into a diverse set with highly cost-effective multistage contrastive learning, thus capturing distinct feature subspaces efficiently. To fully exploit these fine-tuned models while minimizing computational overhead, we transform them into a CLIP-MoE, which dynamically activates a subset of CLIP experts, achieving an effective balance between model capacity and computational cost. Comprehensive experiments demonstrate the superior performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks when used as a vision encoder.

Jihai Zhang, Xiaoye Qu, Tong Zhu, Yu Cheng• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet V2--
611
Image ClassificationEuroSAT
Accuracy62.2
569
Image ClassificationFlowers102
Accuracy72.1
558
Image ClassificationDTD
Accuracy54.9
542
Text-to-Image RetrievalFlickr30K
R@142.1
531
Image ClassificationFood101
Accuracy88.7
457
Image ClassificationSUN397
Accuracy70.1
441
Image-to-Text RetrievalFlickr30K
R@160.5
429
Image ClassificationAircraft
Accuracy29
333
Image ClassificationStanfordCars
Accuracy74.9
312
Showing 10 of 23 rows

Other info

Follow for update