mHuBERT-147: A Compact Multilingual HuBERT Model
About
We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | Fleurs | WER15.53 | 56 | |
| Acoustic Discriminability (ABX) | 5 Languages (sw, ta, th, tr, uk) (dev) | Triphone ABX (WS)7.37 | 22 | |
| Acoustic Discriminability (ABX) | Zero Resource Speech Challenge (en, fr, zh, de, wo) 2017 | ABX Triphone 1s (WS)6.93 | 22 | |
| Automatic Speech Recognition | kathbath Tamil | WER31.82 | 20 | |
| Speech Recognition | Common Voice | -- | 17 | |
| Automatic Speech Recognition | MLC-SLM (dev) | WER/CER22.5 | 15 | |
| Automatic Speech Recognition | Common Voice Spanish (test) | WER27.38 | 12 | |
| Automatic Speech Recognition | Common Voice Mandarin (test) | CER19.82 | 12 | |
| Automatic Speech Recognition | SBCSAE Large (test) | WER0.6835 | 12 | |
| Automatic Speech Recognition | Kathbath Hindi (test) | WER17.55 | 12 |