Distilling Textual Priors from LLM to Efficient Image Fusion

About

Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model's ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.

Ran Zhang, Xuanhua He, Ke Cao, Liu Liu, Li Zhang, Man Zhou, Jie Zhang• 2025

Related benchmarks

Task	Dataset	Result
Infrared and Visible Image Fusion	RoadScene	Qabf0.639	42
Infrared-Visible Image Fusion	MSRS	QAB/F (Quality Assessment Block/Fusion)0.732	38
Object Detection	MSRS (test)	mAP@0.593.68	34
Multi-Modal Image Fusion	MRI-CT (test)	EN4.462	30
Medical image fusion	PET-MRI (test)	SSIM1.223	14
Medical image fusion	SPECT-MRI (test)	SSIM1.21	14
Image Fusion	MSRS (test)	VIF1.06	13
Object Detection	MSRS	--	11

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord