Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond
About
Multi-modality image fusion, particularly infrared and visible, plays a crucial role in integrating diverse modalities to enhance scene understanding. Although early research prioritized visual quality, preserving fine details and adapting to downstream tasks remains challenging. Recent approaches attempt task-specific design but rarely achieve "The Best of Both Worlds" due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Enable downstream task adaptability, namely SAGE. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. More importantly, to eliminate the impractical dependence on SAM during inference, we introduce a bi-level optimization-driven distillation mechanism with triplet losses, which allow the student network to effectively extract knowledge. Extensive experiments show that our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency. The code is available at https://github.com/RollingPlain/SAGE_IVIF.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | MSRS | mIoU73 | 68 | |
| Infrared-Visible Image Fusion | RoadScene (test) | Visual Information Fidelity (VIF)0.53 | 53 | |
| Salient Object Detection | VT5000 | -- | 50 | |
| Semantic segmentation | FMB | mIoU0.6078 | 49 | |
| Visible-Infrared Image Fusion | MSRS (test) | -- | 43 | |
| Infrared-Visible Image Fusion | MSRS | QAB/F (Quality Assessment Block/Fusion)0.6242 | 38 | |
| Infrared-Visible Image Fusion | LLVIP (test) | EN6.96 | 36 | |
| Object Detection | M3FD | AP@[0.5:0.95]62.25 | 35 | |
| Infrared-Visible Image Fusion | KAIST | AG3.376 | 22 | |
| Infrared-Visible Image Fusion | FLIR | AG3.254 | 22 |