Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

About

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB~1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.

Paul-Ambroise Duquenne, Holger Schwenk, Beno\^it Sagot• 2023

Related benchmarks

TaskDatasetResultRank
Cross-modal and cross-lingual retrievalFLEURS (test)
Avg xSIM++ (26)14.3
7
Machine TranslationFLORES X→Eng (devtest)
COMET Score (Low)0.851
6
Cross-lingual similarity searchFLORES X→Eng (devtest)
xSIM++ Low13.1
5
Multilingual ClassificationMTEB
Average Accuracy63.02
4
Pair ClassificationMTEB (test)
Average AP69.7
4
Single Sentence ClassificationSentEval
Accuracy85.82
4
Bitext MiningFLORES-200 34 languages
d-xsim0.04
4
Bitext Mining (with hard negatives)FLORES-200 34 languages
d-xsim++10.55
4
Bitext MiningBUCC 4 languages
BUCC F198.25
4
Semantic Textual SimilarityMTEB (test)
Average STS Score58.04
4
Showing 10 of 14 rows

Other info

Follow for update