Modular and Parameter-Efficient Multimodal Fusion with Prompting

About

Recent research has made impressive progress in large-scale multimodal pre-training. In the context of the rapid growth of model size, it is necessary to seek efficient and flexible methods other than finetuning. In this paper, we propose to use prompt vectors to align the modalities. Our method achieves comparable performance to several other multimodal fusion methods in low-resource settings. We further show that our method is modular and parameter-efficient for processing tasks involving two or more data modalities.

Sheng Liang, Mengjie Zhao, Hinrich Sch\"utze• 2022

Related benchmarks

Task	Dataset	Result
Multimodal Multilabel Classification	MM-IMDB (test)	Macro F150.18	104
Multimodal Classification	UPMC Food-101 (test)	Accuracy84.56	28
Multimodal Classification	SNLI-VE (test)	Accuracy65.54	22
Procedural Multimedia Reasoning	PMR (test)	Accuracy76.5	15
Procedural Multimedia Reasoning	PMR (val)	Accuracy77.4	8

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord