CROME: Cross-Modal Adapters for Efficient Multimodal LLM

About

Multimodal Large Language Models (MLLMs) demonstrate remarkable image-language capabilities, but their widespread use faces challenges in cost-effective training and adaptation. Existing approaches often necessitate expensive language model retraining and limited adaptability. Additionally, the current focus on zero-shot performance improvements offers insufficient guidance for task-specific tuning. We propose CROME, an efficient vision-language instruction tuning framework. It features a novel gated cross-modal adapter that effectively combines visual and textual representations prior to input into a frozen LLM. This lightweight adapter, trained with minimal parameters, enables efficient cross-modal understanding. Notably, CROME demonstrates superior zero-shot performance on standard visual question answering and instruction-following benchmarks. Moreover, it yields fine-tuning with exceptional parameter efficiency, competing with task-specific specialist state-of-the-art methods. CROME demonstrates the potential of pre-LM alignment for building scalable, adaptable, and parameter-efficient multimodal models.

Sayna Ebrahimi, Sercan O. Arik, Tejas Nama, Tomas Pfister• 2024

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	--	847
Multimodal Evaluation	MME	--	727
Multimodal Understanding	SEED-Bench	--	516
Multimodal Understanding	MMMU	Accuracy41.2	437
Visual Question Answering	AI2D	Accuracy75.3	317
Multimodal Evaluation	MM-Vet	--	196
Hallucination Evaluation	HallusionBench	--	153
Visual Instruction Following	LLaVA-Bench Wild	--	71
Visual Question Answering	ScienceQA Image (test)	Accuracy93.2	45

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord