Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
About
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER5.2 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER2.7 | 833 | |
| Speech Translation | CoVoST-2 (test) | Avg BLEU (15 Dir)30.7 | 46 | |
| Automatic Speech Recognition | AISHELL-1 1.0 (test) | CER (Offline, Rescoring)5.31 | 7 | |
| Automatic Speech Recognition | English Hardcase (test) | F1 Score63.3 | 7 | |
| Four-way emotion classification | IEMOCAP (leave-one-session-out five-fold cross val) | ACC71.06 | 5 | |
| Automatic Speech Recognition | English Multi-domain (val) | WER9.33 | 4 | |
| Automatic Speech Recognition | MLS | WER (ES)4.2 | 4 | |
| Automatic Speech Recognition | English Multi-accent (evaluation set) | WER22.19 | 4 | |
| Automatic Speech Recognition | Multilingual Multi-domain (evaluation set) | WER21.51 | 3 |