Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

About

Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.

Souptik Sen, Raneen Younis, Zahra Ahmadi• 2026

Related benchmarks

Task	Dataset	Result
Event Classification (V → A)	VGGSound-AVEL 40K	Precision73.5	15
Event Classification (A → V)	VGGSound-AVEL 90K	Precision65	15
Event Classification (V → A)	VGGSound-AVEL 90K	Precision50.8	11
Event Localization (A → V)	VGGSound-AVEL 40K	Segment-level Accuracy72.5	11
Event Localization (A → V)	VGGSound-AVEL 90K	Segment-level Accuracy70.4	11
Event Localization (V → A)	VGGSound-AVEL 40K	Segment-level Accuracy70.8	11
Event Localization (V → A)	VGGSound-AVEL 90K	Segment-level Accuracy69.7	11
Event Classification (A → V)	VGGSound-AVEL 40K	Precision55.5	11
Cross-modal retrieval	Clotho A↔T zero-shot	Recall@15.64	6
Cross-modal retrieval	MSCOCO V↔T zero-shot	R@11.3	6

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord