ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt

About

Large Multimodal Models (LMMs) exhibit remarkable multi-tasking ability by learning mixed instruction datasets. However, novel tasks would be encountered sequentially in dynamic world, which urges for equipping LMMs with multimodal continual instruction learning (MCIT) ability especially for diverse and challenging generative tasks. Existing MCIT methods do not fully exploit the unique attribute of LMMs and often gain performance at the expense of efficiency. In this paper, we propose a novel prompt learning framework for MCIT to effectively alleviate forgetting of previous knowledge while managing computational complexity with natural image-text supervision. Concretely, we learn prompts for each task and exploit efficient prompt fusion for knowledge transfer and prompt selection for complexity management with dual-modality guidance. Extensive experiments demonstrate that our approach achieves substantial +14.26% performance gain on MCIT benchmarks with remarkable $\times$ 1.42 inference speed free from growing computation. Code is available at https://github.com/AuroraZengfh/ModalPrompt.

Fanhu Zeng, Fei Zhu, Haiyang Guo, Xu-Yao Zhang, Cheng-Lin Liu• 2024

Related benchmarks

Task	Dataset	Result
Multimodal Continual Instruction Tuning	UCIT (Unified Continual Instruction Tuning)	Average Score (UCIT)67.93	40
Continual Instruction Tuning	UCIT	Image-R Score80.5	30
Continual Instruction Tuning	MLLM-DCL	RS Score58.19	20
Continual Learning	MLLM-CL	RS Last Score65.99	18
Image Captioning	ToS-TextCaps	BLEU-4 Average11.7	18
Visual Question Answering	ToS-TextVQA	Accuracy (Avg)4.06	18
Continual Learning	Evidence-sensitive stream	Average Score56.99	16
Multimodal Continual Instruction Tuning	COIN	--	13
Multimodal Continual Instruction Tuning	TriGap v1 (test)	PMCVQA Score38.23	10
Multimodal Continual Instruction Tuning	TriGap	PMCVQA38.23	10

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord