ML-SUPERB: Multilingual Speech Universal PERformance Benchmark
About
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks. However, SUPERB largely considers English speech in its evaluation. This paper presents multilingual SUPERB (ML-SUPERB), covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification. Following the concept of SUPERB, ML-SUPERB utilizes frozen SSL features and employs a simple framework for multilingual tasks by learning a shallow downstream model. Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features. Furthermore, we find that multilingual models do not always perform better than their monolingual counterparts. We will release ML-SUPERB as a challenge with organized datasets and reproducible training scripts for future multilingual representation research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | ML-SUPERB 10-min Normal | CER29 | 26 | |
| Language Identification | ML-SUPERB 10-min Normal | LID Accuracy89.1 | 18 | |
| Automatic Speech Recognition | 10-min ML-SUPERB Few-shots | ASR CER39 | 12 | |
| Language Identification | ML-SUPERB 1hr Normal | Accuracy90.9 | 10 | |
| Automatic Speech Recognition | ML-SUPERB 1hr Normal | CER22.7 | 10 | |
| Speaker Verification | VoxCeleb 10min context Normal | EER1.29 | 10 | |
| Speaker Verification | VoxCeleb 1hr context Normal | EER0.0129 | 10 | |
| Speaker Verification | VoxCeleb | EER1.29 | 8 | |
| Automatic Speech Recognition | ML-SUPERB 10-min Few-shots 1.0 | ASR CER39 | 4 | |
| Language Identification | ML-SUPERB 10-min Few-shots | LID Acc83.9 | 4 |