Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

About

Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. Recent research suggests that Mixture of Experts (MoE) architectures are useful for instruction tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to (softly) mix many multimodal low rank experts, and avoids introducing a significant number of new parameters compared to conventional MoE models. The core intuition here is that the large model provides a foundational backbone, while different lightweight experts residually learn specialized knowledge, either per-modality or multimodally. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of generative vision-and-language tasks, achieving new SoTA generalist performance that often matches or outperforms single specialized LMM baselines, as well as new SoTA specialist performance.

Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy85.7	712
Image Captioning	MS COCO Karpathy (test)	CIDEr1.498	706
Document Visual Question Answering	DocVQA (test)	ANLS90.8	292
Science Question Answering	ScienceQA (test)	Average Accuracy67.8	273
Chart Question Answering	ChartQA (test)	--	190
Information Visual Question Answering	InfoVQA (test)	ANLS80.3	130
Visual Question Answering	TextVQA (test)	Accuracy81.1	124
Visual Question Answering	ScienceQA (test)	Accuracy94.7	115
Visual Question Answering	AI2D (test)	Accuracy82.5	82
Visual Question Answering	VQAv2 (test-dev)	Accuracy85	80

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord