Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

About

Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.

Souptik Sen, Raneen Younis, Zahra Ahmadi• 2026

Related benchmarks

TaskDatasetResultRank
Event Classification (V → A)VGGSound-AVEL 40K
Precision73.5
15
Event Classification (A → V)VGGSound-AVEL 90K
Precision65
15
Event Classification (V → A)VGGSound-AVEL 90K
Precision50.8
11
Event Localization (A → V)VGGSound-AVEL 40K
Segment-level Accuracy72.5
11
Event Localization (A → V)VGGSound-AVEL 90K
Segment-level Accuracy70.4
11
Event Localization (V → A)VGGSound-AVEL 40K
Segment-level Accuracy70.8
11
Event Localization (V → A)VGGSound-AVEL 90K
Segment-level Accuracy69.7
11
Event Classification (A → V)VGGSound-AVEL 40K
Precision55.5
11
Cross-modal retrievalClotho A↔T zero-shot
Recall@15.64
6
Cross-modal retrievalMSCOCO V↔T zero-shot
R@11.3
6
Showing 10 of 20 rows

Other info

Follow for update