UNIV: Unified Foundation Model for Infrared and Visible Modalities
About
Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | MSRS | mIoU79 | 93 | |
| Object Detection | M3FD-IR (test) | mAP56.9 | 11 | |
| Semantic segmentation | MSRS Infrared (test) | mIoU76.6 | 11 | |
| Semantic segmentation | SODA-IR (test) | mIoU69.6 | 8 | |
| Semantic segmentation | MFNet-IR (val) | mIoU50.78 | 8 | |
| Semantic segmentation | MFNet-IR (test) | mIoU51.06 | 8 | |
| Semantic segmentation | MSRS IR | mIoU0.76 | 4 | |
| Semantic segmentation | ADE20K RGB | mIoU51.2 | 3 |