Closing the Modality Gap Aligns Group-Wise Semantics
About
In multimodal learning, CLIP has been recognized as the \textit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Cross-modal retrieval | MSR-VTT 3 modal | Gap7 | 7 | |
| Captioning | MSCOCO 2 modal | BLEU-146.1 | 4 | |
| Captioning | MSR-VTT 3 modal | BLEU@126.8 | 4 | |
| Multimodal Retrieval | MSCOCO 2 modal | Gap0.03 | 4 | |
| Classification | MOSI MultiBench (test) | Gap0.24 | 3 | |
| Clustering | OpenImage V7 (val) | V-Measure17.2 | 3 | |
| Cross-modal retrieval | MSCOCO 2 modal | Gap0.03 | 3 | |
| Cross-modal retrieval | AV-MNIST 3 modal | Gap0.09 | 3 | |
| Modality Gap Analysis | OpenImage V7 (val) | Gap31.1 | 3 | |
| Classification | UR-FUNNY MultiBench (test) | Gap0.77 | 3 |