Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
About
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. Recent research suggests that Mixture of Experts (MoE) architectures are useful for instruction tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost of replicating and storing the expert models severely limits the number of experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to (softly) mix many multimodal low rank experts, and avoids introducing a significant number of new parameters compared to conventional MoE models. The core intuition here is that the large model provides a foundational backbone, while different lightweight experts residually learn specialized knowledge, either per-modality or multimodally. Extensive experiments demonstrate that the SMoLA approach helps improve the generalist performance across a broad range of generative vision-and-language tasks, achieving new SoTA generalist performance that often matches or outperforms single specialized LMM baselines, as well as new SoTA specialist performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr1.498 | 682 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy85.7 | 664 | |
| Science Question Answering | ScienceQA (test) | Average Accuracy67.8 | 208 | |
| Document Visual Question Answering | DocVQA (test) | ANLS90.8 | 192 | |
| Chart Question Answering | ChartQA (test) | -- | 129 | |
| Visual Question Answering | TextVQA (test) | Accuracy81.1 | 124 | |
| Visual Question Answering | ScienceQA (test) | Accuracy94.7 | 95 | |
| Information Visual Question Answering | InfoVQA (test) | ANLS80.3 | 92 | |
| Visual Question Answering | OCR-VQA (test) | Accuracy75.7 | 77 | |
| Visual Question Answering | VQAv2 (test-dev) | Accuracy85 | 76 |