CROME: Cross-Modal Adapters for Efficient Multimodal LLM
About
Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities, but their widespread use faces challenges in cost-effective training and adaptation. Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning. We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM. This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods. CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Evaluation | MME | -- | 557 | |
| Multimodal Understanding | MMBench | -- | 367 | |
| Multimodal Understanding | MMMU | Accuracy41.2 | 275 | |
| Multimodal Understanding | SEED-Bench | -- | 203 | |
| Visual Question Answering | AI2D | Accuracy75.3 | 174 | |
| Multimodal Evaluation | MM-Vet | Accuracy55.1 | 122 | |
| Hallucination Evaluation | HallusionBench | Average Score51.3 | 93 | |
| Visual Question Answering | ScienceQA Image (test) | Accuracy93.2 | 45 | |
| Visual Instruction Following | LLaVA-Bench Wild | -- | 35 |