Convoifilter: A case study of doing cocktail party speech recognition

About

This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise (ConVoiFilter) and an ASR module. The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.

Thai-Binh Nguyen, Alexander Waibel• 2023

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER10.2	1207
Automatic Speech Recognition	LibriSpeech Clean other (test)	WER11.3	34
Automatic Speech Recognition	LibriSpeech other Speech Noise - Reverb (test)	WER37.1	28
Automatic Speech Recognition	LibriSpeech clean Speech Noise - Reverb (test)	WER29.8	28
Automatic Speech Recognition	LibriSpeech other Speech Noise - Additive (test)	WER20.6	28
Automatic Speech Recognition	LibriSpeech clean Speech Noise - Additive (test)	WER14.9	28
Target Speaker Extraction	Libri2Mix Clean (test)	--	9
Target Speaker Extraction	Libri2Mix Single Speaker (test)	WER10.5	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord