GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR
About
End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech. A critical bottleneck is that speaker-specific acoustic characteristics, which are essential for distinguishing overlapping speech, are often diluted in deep network layers. To address this, we propose the Global-Local Aware Dynamic Mixture-of-Experts (GLAD) architecture. GLAD introduces a novel routing mechanism that dynamically fuses speaker-aware global context with fine-grained local acoustic details to adaptively guide expert selection. Experiments on the LibriSpeechMix and CH109 datasets demonstrate that GLAD significantly outperforms existing Serialized Output Training (SOT)-based MTASR approaches, exhibiting exceptional robustness in challenging, high-overlap scenarios. To the best of our knowledge, this is the first work to apply a global-local fusion MoE strategy to MTASR.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-talker Automatic Speech Recognition | LibriSpeech (dev) | WER3.7 | 9 | |
| Multi-talker Automatic Speech Recognition | LibriSpeech (test) | WER4.1 | 9 | |
| Multi-talker Automatic Speech Recognition | LibrispeechMix 2mix (dev) | WER7.2 | 9 | |
| Multi-talker Automatic Speech Recognition | LibrispeechMix 2mix (test) | WER6.8 | 9 | |
| Multi-talker Automatic Speech Recognition | LibrispeechMix 3mix (dev) | WER21.7 | 9 | |
| Multi-talker Automatic Speech Recognition | LibrispeechMix 3mix (test) | WER21.1 | 9 | |
| Multi-talker Automatic Speech Recognition | CH109 1-speaker | WER32.5 | 4 | |
| Multi-talker Automatic Speech Recognition | CH109 2-speaker | WER48.9 | 4 | |
| Multi-talker Automatic Speech Recognition | CH109 | Total WER0.366 | 4 |