Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

About

Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.

Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng• 2024

Related benchmarks

TaskDatasetResultRank
Multi-speaker ASRLibriSpeech2Mix
WER4
14
Multi-speaker ASRLibriSpeech3Mix
WER7.5
9
Automatic Speech RecognitionLibriSpeech2Mix simulated (eval)
Word Error Rate6.99
7
Automatic Speech RecognitionLibriSpeech3Mix simulated (eval)
WER11.4
7
Multi-speaker ASRLibri2Mix mixclean
WER6.56
2
Multi-speaker ASRLibri3Mix mixclean
WER21.47
2
Showing 6 of 6 rows

Other info

Follow for update