Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

About

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework. We release our code and dataset at https://github.com/estafons/confu.

Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides, Grigorios Tsagkatakis• 2025

Related benchmarks

TaskDatasetResultRank
ClassificationAV-MNIST
Accuracy71.2
24
Multimodal ClassificationUR-FUNNY
Accuracy64.9
21
Multimodal ClassificationMOSI
Accuracy66.7
13
Multimodal ClassificationMUSTARD
Accuracy64.1
13
ClassificationSSW60 (test)
Accuracy65.5
12
ClassificationSSW60
Accuracy71.4
12
ClassificationVB100 (test)
Accuracy (%)16.7
12
ClassificationVB100
Accuracy19.3
12
Multimodal RetrievalMOSI
Recall@10 (Q: M23, T: M1)16.7
4
Multimodal RetrievalMUSTARD
Recall@10 (Q: M23, T: M1)79.6
4
Showing 10 of 11 rows

Other info

Follow for update