Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
About
Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech Other | WER2.93 | 75 | |
| Automatic Speech Recognition | LibriSpeech Clean | WER1.48 | 57 | |
| Spoken Intelligence Evaluation | LLM_Voice 1.0 (test) | Remembering Score55.9 | 13 | |
| Automated Speech Recognition | SPGI Speech | WER2.06 | 13 | |
| Automated Speech Recognition | Giga Speech | WER10.12 | 13 |