Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

About

Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled to H/14 (with additional data) it achieves 84.1%, comparable to state-of-the-art methods which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMoE, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-specific experts.

Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, Neil Houlsby• 2022

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc84.1	1239
Multimodal Sentiment Analysis	CMU-MOSEI (test)	F1 Score61.2	401
Mathematical Reasoning	GSM8K	Accuracy66.1	388
Multimodal Sentiment Analysis	CMU-MOSI (test)	F147.79	385
Multimodal Understanding	MMMU (val)	--	199
Multimodal Sentiment Analysis	CMU-MOSI	--	166
Alzheimer stage classification	ADNI	AUC72.25	116
Multimodal Model Evaluation	MME	Score1.83e+3	102
Mortality Prediction	MIMIC IV	Accuracy64.89	88
Mortality Prediction	MIMIC-IV (test)	AUC65.18	64

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord