Distilling Textual Priors from LLM to Efficient Image Fusion
About
Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model's ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | MSRS (test) | mAP@0.593.68 | 34 | |
| Multi-Modal Image Fusion | MRI-CT (test) | EN4.462 | 30 | |
| Infrared and Visible Image Fusion | RoadScene | MI3.454 | 28 | |
| Infrared-Visible Image Fusion | MSRS | Entropy (EN)6.749 | 23 | |
| Medical image fusion | PET-MRI (test) | SSIM1.223 | 14 | |
| Medical image fusion | SPECT-MRI (test) | SSIM1.21 | 14 | |
| Image Fusion | MSRS (test) | VIF1.06 | 13 | |
| Object Detection | MSRS | mAP@5093.68 | 10 |