CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

About

Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM's rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student's visual attention with that of the teacher to enhance the student's visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.

Jiwan Kim, Kibum Kim, Sangwoo Seo, Chanyoung Park• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy56.4	1453
Science Question Answering	ScienceQA	Accuracy70.1	791
Visual Question Answering	GQA	Accuracy62.2	155
Visual Question Answering	VQA v2	Overall Accuracy78.8	45
Compositional Reasoning	Compositional Reasoning Suite Aggregated	Sugarcrepe Score82.9	23
Visual Question Answering	General VQA VQAv2, VizWiz, GQA, TextVQA, MME	GQA Accuracy62.2	23
Relational Hallucination Evaluation	R-Bench	F1 Score78.6	5
Relational Hallucination Evaluation	Reefknot	F1 Score66.7	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord