Multimodal Representation Learning Conditioned on Semantic Relations
About
Multimodal representation learning has been largely driven by contrastive models such as CLIP, which learn a shared embedding space by aligning paired image-text samples. While effective for general-purpose representation learning, such models typically produce a single embedding per sample that is reused across different semantic relations and contexts. However, in many real-world applications, relevance between samples is inherently relation-dependent, with different semantic relations emphasizing different aspects of multimodal data. In this work, we propose Relation-Conditioned Multimodal Learning (RCML), a framework that treats semantic relations as explicit conditions of multimodal representation learning. Rather than producing relation-agnostic embeddings, RCML learns representations conditioned on natural-language relation descriptions, allowing the same sample to be represented differently under different relational contexts. The framework constructs relation-aware training pairs, introduces a relation-conditioned module to adapt embeddings to relation semantics, and employs a unified contrastive objective to jointly model cross-modal alignment and relation-induced inter-sample structure. Experiments on multiple datasets show that RCML consistently outperforms strong baselines on retrieval and classification tasks in zero-shot, fine-tuned, and out-of-domain settings, highlighting the effectiveness of leveraging semantic relations to guide multimodal representation learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Relation-Conditioned Retrieval | Elec | Hit@549.32 | 30 | |
| Relation-Conditioned Retrieval | Auto | Hit@544.38 | 30 | |
| Relation-Conditioned Retrieval | OFFICE | Hit@549.64 | 30 | |
| Relation-Conditioned Retrieval | Baby | Hit@544.65 | 30 | |
| Relation-Conditioned Retrieval | Pet | Hit@555.08 | 30 | |
| Relation-Conditioned Retrieval | music | Hit@551.53 | 30 | |
| Relation-Conditioned Retrieval | Sports | Hit@564.81 | 30 | |
| Relation-Conditioned Retrieval | Goodread | Hit@553.26 | 30 | |
| Relation-Conditioned Retrieval | Elec | MRR32.95 | 30 | |
| Relation-Conditioned Retrieval | Auto | MRR30.34 | 30 |