Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

About

End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech. A critical bottleneck is that speaker-specific acoustic characteristics, which are essential for distinguishing overlapping speech, are often diluted in deep network layers. To address this, we propose the Global-Local Aware Dynamic Mixture-of-Experts (GLAD) architecture. GLAD introduces a novel routing mechanism that dynamically fuses speaker-aware global context with fine-grained local acoustic details to adaptively guide expert selection. Experiments on the LibriSpeechMix and CH109 datasets demonstrate that GLAD significantly outperforms existing Serialized Output Training (SOT)-based MTASR approaches, exhibiting exceptional robustness in challenging, high-overlap scenarios. To the best of our knowledge, this is the first work to apply a global-local fusion MoE strategy to MTASR.

Yujie Guo, Jiaming Zhou, Yuhang Jia, Shiwan Zhao, Yong Qin• 2025

Related benchmarks

TaskDatasetResultRank
Multi-talker Automatic Speech RecognitionLibriSpeech (dev)
WER3.7
9
Multi-talker Automatic Speech RecognitionLibriSpeech (test)
WER4.1
9
Multi-talker Automatic Speech RecognitionLibrispeechMix 2mix (dev)
WER7.2
9
Multi-talker Automatic Speech RecognitionLibrispeechMix 2mix (test)
WER6.8
9
Multi-talker Automatic Speech RecognitionLibrispeechMix 3mix (dev)
WER21.7
9
Multi-talker Automatic Speech RecognitionLibrispeechMix 3mix (test)
WER21.1
9
Multi-talker Automatic Speech RecognitionCH109 1-speaker
WER32.5
4
Multi-talker Automatic Speech RecognitionCH109 2-speaker
WER48.9
4
Multi-talker Automatic Speech RecognitionCH109
Total WER0.366
4
Showing 9 of 9 rows

Other info

Follow for update