HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

About

Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Code and dataset are released at https://github.com/Ghy0501/HiDe-LLaVA.

Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu• 2025

Related benchmarks

Task	Dataset	Result
Continual Instruction Tuning	UCIT	Image-R Score89.33	30
Multimodal Continual Instruction Tuning	UCIT (Unified Continual Instruction Tuning)	ImgNet-R Score87.62	28
Continual Visual Question Answering	VQA v2 (test)	Rec. Accuracy49.27	23
Continual Instruction Tuning	MLLM-DCL	RS Score77.73	20
Continual Learning	MLLM-CL Ability	OCR Score24.6	17
Domain-incremental learning	MLLM-CL Domain	RS Score74.8	17
Continual Image Editing	CIE-Bench Avg	ERP Score8.0622	14
Continual Learning	MLLM-CL (test)	RS Score74.3	13
Continual Image Editing	CIE-Bench Last	ERP Score8.4194	12
Continual Learning	UCIT (Avg)	ImageNet-R Accuracy85.7	12

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord