Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SENSE models: an open source solution for multilingual and multimodal semantic-based tasks

About

This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI's SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better initial speech encoder. The source code for training and using SENSE models has been integrated into the SpeechBrain toolkit, and the first SENSE model we trained has been publicly released. We report experimental results on multilingual and multimodal semantic tasks, where our SENSE model achieves highly competitive performance. Finally, this study offers new insights into how semantics are captured in such semantically aligned speech encoders.

Salima Mdhaffar, Haroun Elleuch, Chaimae Chellaf, Ha Nguyen, Yannick Est\`eve• 2025

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLoquacious (dev)
WER17.67
18
Automatic Speech RecognitionLoquacious (test)
WER18.42
17
Speech TranslationFLEURS X→En (test)--
12
Speech-to-speech translation retrieval (EN to Y)VoxPopuli
EN->FR Retrieval Score96.54
3
Speech-to-speech translation retrieval (X to EN)VoxPopuli
FR-EN Performance96.55
3
Speech-to-speech translation retrieval (X to Y)VoxPopuli
FR to DE Retrieval Performance95.2
3
Speech-to-text translation retrievalMTEDx
Retrieval Score (IT->EN)90.69
3
Speech-to-text translation retrievalFleurs
NY-CES Retrieval Score27
3
Showing 8 of 8 rows

Other info

Follow for update