Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

About

The success of Large Language Models (LLMs) has inspired the development of Multimodal Large Language Models (MLLMs) for unified understanding of vision and language. However, the increasing model size and computational complexity of large-scale MLLMs (l-MLLMs) limit their use in resource-constrained scenarios. Although small-scale MLLMs (s-MLLMs) are designed to reduce computational costs, they typically suffer from performance degradation. To mitigate this limitation, we propose a novel LLaVA-KD framework to transfer knowledge from l-MLLMs to s-MLLMs. Specifically, we introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities, and Relation Distillation (RDist) to transfer teacher model's ability to capture visual token relationships. Additionally, we propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy: 1) Distilled Pre-Training to strengthen the alignment between visual-linguistic representations in s-MLLMs, 2) Supervised Fine-Tuning to equip the s-MLLMs with multimodal understanding capacity, and 3) Distilled Fine-Tuning to refine s-MLLM's knowledge. Our approach significantly improves s-MLLMs performance without altering the model architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component. Code will be available at https://github.com/Fantasyele/LLaVA-KD.

Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, Zhucun Xue, Yong Liu, Xiang Bai• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Visual Question AnsweringGQA
Accuracy62.3
1249
Text-based Visual Question AnsweringTextVQA
Accuracy53.4
807
Multimodal EvaluationMME
Score69.1
658
Multimodal UnderstandingMMBench
Accuracy64
637
Science Question AnsweringScienceQA (SQA)
Accuracy64.7
273
Speech Emotion RecognitionRAVDESS
Unweighted Accuracy89.36
43
Speech Emotion RecognitionSAVEE
WA87.5
23
Visual Question AnsweringGeneral VQA VQAv2, VizWiz, GQA, TextVQA, MME
GQA Accuracy62.3
23
Compositional ReasoningCompositional Reasoning Suite Aggregated
Sugarcrepe Score75.3
23
Showing 10 of 13 rows

Other info

Follow for update