Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
About
Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | VoxPopuli | -- | 38 | |
| Automatic Speech Recognition | CommonVoice | CER (English)7.57 | 19 | |
| Automatic Speech Recognition | CHiME-4 | WER (Real Condition)4.38 | 13 | |
| Automatic Speech Recognition | NOIZEUS | Error Rate (0dB SNR)19.8 | 13 | |
| Automatic Speech Recognition | Fleurs | WER (en)3.17 | 13 | |
| Automatic Speech Recognition | VOiCES | RM12.36 | 13 | |
| Automatic Speech Recognition | LibriSpeech (test) | WER (clean)1.63 | 13 | |
| Automatic Speech Recognition | LibriSpeech (dev) | WER (clean)1.62 | 13 | |
| Automatic Speech Recognition | Wenetspeech | WER (net)4.95 | 10 | |
| Automatic Speech Recognition | AISHELL-1 | Word Error Rate (WER)1.49 | 10 |