Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

About

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.

Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye, Xiaobin Hu, Shuicheng Yan, Chunyan Miao• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionVoxPopuli--
38
Automatic Speech RecognitionCommonVoice
CER (English)7.57
19
Automatic Speech RecognitionCHiME-4
WER (Real Condition)4.38
13
Automatic Speech RecognitionNOIZEUS
Error Rate (0dB SNR)19.8
13
Automatic Speech RecognitionFleurs
WER (en)3.17
13
Automatic Speech RecognitionVOiCES
RM12.36
13
Automatic Speech RecognitionLibriSpeech (test)
WER (clean)1.63
13
Automatic Speech RecognitionLibriSpeech (dev)
WER (clean)1.62
13
Automatic Speech RecognitionWenetspeech
WER (net)4.95
10
Automatic Speech RecognitionAISHELL-1
Word Error Rate (WER)1.49
10
Showing 10 of 10 rows

Other info

GitHub

Follow for update