SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
About
We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB~1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Cross-modal and cross-lingual retrieval | FLEURS (test) | Avg xSIM++ (26)14.3 | 7 | |
| Machine Translation | FLORES X→Eng (devtest) | COMET Score (Low)0.851 | 6 | |
| Cross-lingual similarity search | FLORES X→Eng (devtest) | xSIM++ Low13.1 | 5 | |
| Multilingual Classification | MTEB | Average Accuracy63.02 | 4 | |
| Pair Classification | MTEB (test) | Average AP69.7 | 4 | |
| Single Sentence Classification | SentEval | Accuracy85.82 | 4 | |
| Bitext Mining | FLORES-200 34 languages | d-xsim0.04 | 4 | |
| Bitext Mining (with hard negatives) | FLORES-200 34 languages | d-xsim++10.55 | 4 | |
| Bitext Mining | BUCC 4 languages | BUCC F198.25 | 4 | |
| Semantic Textual Similarity | MTEB (test) | Average STS Score58.04 | 4 |