GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

About

End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech. A critical bottleneck is that speaker-specific acoustic characteristics, which are essential for distinguishing overlapping speech, are often diluted in deep network layers. To address this, we propose the Global-Local Aware Dynamic Mixture-of-Experts (GLAD) architecture. GLAD introduces a novel routing mechanism that dynamically fuses speaker-aware global context with fine-grained local acoustic details to adaptively guide expert selection. Experiments on the LibriSpeechMix and CH109 datasets demonstrate that GLAD significantly outperforms existing Serialized Output Training (SOT)-based MTASR approaches, exhibiting exceptional robustness in challenging, high-overlap scenarios. To the best of our knowledge, this is the first work to apply a global-local fusion MoE strategy to MTASR.

Yujie Guo, Jiaming Zhou, Yuhang Jia, Shiwan Zhao, Yong Qin• 2025

Related benchmarks

Task	Dataset	Result
Multi-talker Automatic Speech Recognition	LibriSpeech (dev)	WER3.7	9
Multi-talker Automatic Speech Recognition	LibriSpeech (test)	WER4.1	9
Multi-talker Automatic Speech Recognition	LibrispeechMix 2mix (dev)	WER7.2	9
Multi-talker Automatic Speech Recognition	LibrispeechMix 2mix (test)	WER6.8	9
Multi-talker Automatic Speech Recognition	LibrispeechMix 3mix (dev)	WER21.7	9
Multi-talker Automatic Speech Recognition	LibrispeechMix 3mix (test)	WER21.1	9
Multi-talker Automatic Speech Recognition	CH109 1-speaker	WER32.5	4
Multi-talker Automatic Speech Recognition	CH109 2-speaker	WER48.9	4
Multi-talker Automatic Speech Recognition	CH109	Total WER0.366	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord