Anatomy of Industrial Scale Multilingual ASR
About
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER3.1 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.6 | 833 | |
| Automatic Speech Recognition | Fleurs | -- | 56 | |
| Automated Speech Recognition | TED-LIUM V3 | WER7.4 | 26 | |
| Automatic Speech Recognition | English Hardcase (test) | F1 Score77.82 | 7 | |
| Automatic Speech Recognition | MLS | WER (ES)3.3 | 4 | |
| Automatic Speech Recognition | English Multi-accent (evaluation set) | WER14.4 | 4 | |
| Automatic Speech Recognition | English Multi-domain (val) | WER9.95 | 4 |