Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Closing the Modality Gap Aligns Group-Wise Semantics

About

In multimodal learning, CLIP has been recognized as the \textit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$-modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.

Eleonora Grassucci, Giordano Cicchetti, Emanuele Frasca, Aurelio Uncini, Danilo Comminiello• 2026

Related benchmarks

TaskDatasetResultRank
Cross-modal retrievalMSR-VTT 3 modal
Gap7
7
CaptioningMSCOCO 2 modal
BLEU-146.1
4
CaptioningMSR-VTT 3 modal
BLEU@126.8
4
Multimodal RetrievalMSCOCO 2 modal
Gap0.03
4
ClassificationMOSI MultiBench (test)
Gap0.24
3
ClusteringOpenImage V7 (val)
V-Measure17.2
3
Cross-modal retrievalMSCOCO 2 modal
Gap0.03
3
Cross-modal retrievalAV-MNIST 3 modal
Gap0.09
3
Modality Gap AnalysisOpenImage V7 (val)
Gap31.1
3
ClassificationUR-FUNNY MultiBench (test)
Gap0.77
3
Showing 10 of 18 rows

Other info

Follow for update