Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning

About

Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.

Kang He, Yuzhe Ding, Haining Wang, Fei Li, Chong Teng, Donghong Ji• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30k (test)
Recall@126.7
423
Image-to-Text RetrievalFlickr30k (test)
R@119.5
370
Semantic Textual SimilaritySTS tasks (STS12, STS13, STS14, STS15, STS16, STS-B, SICK-R)
STS12 Score73.9
195
Transfer LearningSentEval Transfer Learning Tasks (test)
MR83.57
52
Sentence Embedding EvaluationMTEB (test)
Re-Rank Score48.35
48
Showing 5 of 5 rows

Other info

Code

Follow for update