Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
About
Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Event Classification (V → A) | VGGSound-AVEL 40K | Precision73.5 | 15 | |
| Event Classification (A → V) | VGGSound-AVEL 90K | Precision65 | 15 | |
| Event Classification (V → A) | VGGSound-AVEL 90K | Precision50.8 | 11 | |
| Event Localization (A → V) | VGGSound-AVEL 40K | Segment-level Accuracy72.5 | 11 | |
| Event Localization (A → V) | VGGSound-AVEL 90K | Segment-level Accuracy70.4 | 11 | |
| Event Localization (V → A) | VGGSound-AVEL 40K | Segment-level Accuracy70.8 | 11 | |
| Event Localization (V → A) | VGGSound-AVEL 90K | Segment-level Accuracy69.7 | 11 | |
| Event Classification (A → V) | VGGSound-AVEL 40K | Precision55.5 | 11 | |
| Cross-modal retrieval | Clotho A↔T zero-shot | Recall@15.64 | 6 | |
| Cross-modal retrieval | MSCOCO V↔T zero-shot | R@11.3 | 6 |