ClearerVoice-Studio: Bridging Advanced Speech Processing Research and Practical Deployment
About
This paper introduces ClearerVoice-Studio, an open-source, AI-powered speech processing toolkit designed to bridge cutting-edge research and practical application. Unlike broad platforms like SpeechBrain and ESPnet, ClearerVoice-Studio focuses on interconnected speech tasks of speech enhancement, separation, super-resolution, and multimodal target speaker extraction. A key advantage is its state-of-the-art pretrained models, including FRCRN with 3 million uses and MossFormer with 2.5 million uses, optimized for real-world scenarios. It also offers model optimization tools, multi-format audio support, the SpeechScore evaluation toolkit, and user-friendly interfaces, catering to researchers, developers, and end-users. Its rapid adoption attracting 3000 GitHub stars and 239 forks highlights its academic and industrial impact. This paper details ClearerVoice-Studio's capabilities, architectures, training strategies, benchmarks, community impact, and future plan. Source code is available at https://github.com/modelscope/ClearerVoice-Studio.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-visual speech separation | LRS3 (test) | SDRi18.1 | 29 | |
| Audio-visual speech separation | LRS2 (test) | SDRi15.5 | 23 | |
| Audio-Visual Target Speaker Extraction | LRS2 2-mix (test) | DNSMOS2.64 | 22 | |
| Audio-visual speech separation | LRS2 | Parameters (M)55.87 | 18 | |
| Audio-visual speech separation | VoxCeleb2 (test) | SI-SNRi14 | 16 | |
| Visual-prompted audio separation | Speaker | IB Score0.2 | 5 | |
| Audio-visual speech separation | VoxCeleb2 Scenario 2: 1 speaker + music noise from FMA | SI-SNRi6.68 | 3 | |
| Audio-visual speech separation | VoxCeleb2 Scenario 4: 1 speaker + music noise + 2 interfering speakers | SI-SNRi3.9 | 3 | |
| Audio-visual speech separation | VoxCeleb2 Scenario 3: 1 speaker + environmental noise + 2 interfering speakers (test) | SI-SNRi3.33 | 3 | |
| Audio-visual speech separation | VoxCeleb2 with FSD50K environmental noise Scenario 1 (test) | SI-SNRi8.33 | 3 |