Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

About

Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \emph{unpaired} modality pairs (e.g., audio$\leftrightarrow$depth, infrared$\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.

Jincheng Xie, Xingchen Xiao, Runheng Liu, Zhongyi Huang, Yu Zheng, Heyan Huang• 2026

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy92
374
Audio ClassificationAudioSet
mAP28.1
54
Audio RetrievalAudioCaps
R@111.8
50
Audio RetrievalClotho
R@111.7
28
Infrared Image ClassificationLLVIP
Top-1 Accuracy85.1
18
Depth Image ClassificationNYU-D
Top-1 Acc70.1
17
Audio ClassificationVGG-S
Top-1 Accuracy36.3
8
Depth classificationSUN
Top-1 Accuracy40.3
7
RGB-to-X retrievalAVE
R@137
4
RGB-to-X retrievalVGG-S
R@130.1
4
Showing 10 of 12 rows

Other info

Follow for update