Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

About

Catastrophic forgetting emerges as a critical challenge when fine-tuning multi-modal large language models (MLLMs), where improving performance on unseen tasks often leads to a significant performance drop on the original tasks. This paper presents a comprehensive analysis of catastrophic forgetting in MLLMs and introduces a post-training adjustment method called Model Tailor. Our method primarily preserves the pre-trained parameters while replacing a small number ($\leq$ 10\%) of fine-tuned parameters, maintaining $\sim$ 99\% effectiveness on original tasks versus pre-training, and achieving $\sim$ 97\% on new tasks compared to standard fine-tuning. Specifically, we derive a sparse mask to identify the "model patch", based on a fusion strategy that integrates salience and sensitivity analysis. Subsequently, a compensation mechanism is introduced to "decorate the patch", enhancing the model's performance on both target and original tasks. Additionally, our method is adaptable to multi-task scenarios. Through extensive experiments on InstructBLIP and LLaVA-1.5 in both image captioning and visual question answering tasks, our approach demonstrates significant task adaptability while preserving inherent pre-trained capabilities.

Didi Zhu, Zhongyi Sun, Zexi Li, Tao Shen, Ke Yan, Shouhong Ding, Kun Kuang, Chao Wu• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy44.87
1043
Visual Question AnsweringGQA
Accuracy37.35
963
Text-based Visual Question AnsweringTextVQA
Accuracy47.36
496
Visual Question AnsweringGQA
Accuracy51.48
374
Science Question AnsweringScienceQA
Accuracy74.9
229
Image ClassificationImageNet
Accuracy78.22
47
Multimodal Question AnsweringScienceQA
Accuracy61.58
35
Image CaptioningCOCO Caption
OKVQA51.83
22
Continual Multimodal Instruction TuningCoIN ScienceQA TextVQA ImageNet GQA VizWiz Grounding Chameleon backbone
Accuracy32.62
22
Referring Expression GroundingRefCOCO RefCOCO+ RefCOCOg
Accuracy36.72
10
Showing 10 of 23 rows

Other info

Follow for update