Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

mSLAM: Massively multilingual joint pre-training for speech and text

About

We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.

Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin Johnson, Yong Cheng, Simran Khanuja, Jason Riesa, Alexis Conneau• 2022

Related benchmarks

TaskDatasetResultRank
Speech TranslationCoVoST-2 (test)
Avg BLEU (15 Dir)24.8
46
Speech RecognitionVoxPopuli (test)
WER9.1
37
Speech-to-text TranslationCoVoST low-resource X-to-En 2 (test)
BLEU (Avg)18.5
24
Speech-to-text TranslationCoVoST-2 high-resource X-to-En (test)--
8
Speech-to-English TranslationCoVoST2 Mid X-en (test)
BLEU29.6
5
Speech-to-English TranslationCoVoST2 All X-en (test)
BLEU24.8
5
Speech RecognitionMultilingual LibriSpeech (MLS) (test)--
4
Language IdentificationFleurs
Accuracy77.7
3
Showing 8 of 8 rows

Other info

Follow for update