SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

About

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB~1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.

Paul-Ambroise Duquenne, Holger Schwenk, Beno\^it Sagot• 2023

Related benchmarks

Task	Dataset	Result
Multimodal Sentiment Analysis	CMU-MOSI	--	166
Speech Emotion Recognition	RAVDESS	Unweighted Accuracy10.8	43
Speech Emotion Recognition	MELD	--	24
Gender Bias Sensitivity Evaluation	MuST-SHE	PA (Spanish)53.1	17
Prosody Sensitivity Evaluation	ContraProST	PA (German)50	17
Speech Translation Quality Estimation	IWSLT (dev)	Segment Tau_b (de)17.3	17
Depression Detection	DAIC-WOZ	Weighted F1-score64.3	8
Speech Emotion Recognition	IEMOCAP 4	Weighted F1-score59.4	8
Speech Emotion Recognition	IEMOCAP-6	Weighted F143.5	8
Cross-modal and cross-lingual retrieval	FLEURS (test)	Avg xSIM++ (26)14.3	7

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord