Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

About

Recent studies have shown that dual encoder models trained with the sentence-level translation ranking task are effective methods for cross-lingual sentence embedding. However, our research indicates that token-level alignment is also crucial in multilingual scenarios, which has not been fully explored previously. Based on our findings, we propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding that incorporates both sentence-level and token-level alignment. To achieve this, we introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart. This reconstruction objective encourages the model to embed translation information into the token representation. Compared to other token-level alignment methods such as translation language modeling, RTL is more suitable for dual encoder architectures and is computationally efficient. Extensive experiments on three sentence-level cross-lingual benchmarks demonstrate that our approach can significantly improve sentence embedding. Our code is available at https://github.com/ChillingDream/DAP.

Ziheng Li, Shaohan Huang, Zihan Zhang, Zhi-Hong Deng, Qiang Lou, Haizhen Huang, Jian Jiao, Furu Wei, Weiwei Deng, Qi Zhang• 2023

Related benchmarks

TaskDatasetResultRank
Bitext MiningFLORES-200 34 languages
d-xsim2.9
4
Bitext Mining (with hard negatives)FLORES-200 34 languages
d-xsim++32.82
4
Bitext MiningBUCC 4 languages
BUCC F198.68
4
Semantic Textual SimilarityMTEB (test)
Average STS Score59.39
4
Multilingual ClassificationMTEB
Average Accuracy61.8
4
Pair ClassificationMTEB (test)
Average AP66.01
4
Single Sentence ClassificationSentEval
Accuracy78.18
4
Showing 7 of 7 rows

Other info

Follow for update