Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Convoifilter: A case study of doing cocktail party speech recognition

About

This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise (ConVoiFilter) and an ASR module. The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.

Thai-Binh Nguyen, Alexander Waibel• 2023

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER10.2
1156
Automatic Speech RecognitionLibriSpeech other Speech Noise - Reverb (test)
WER37.1
28
Automatic Speech RecognitionLibriSpeech clean Speech Noise - Reverb (test)
WER29.8
28
Automatic Speech RecognitionLibriSpeech other Speech Noise - Additive (test)
WER20.6
28
Automatic Speech RecognitionLibriSpeech Clean other (test)
WER11.3
28
Automatic Speech RecognitionLibriSpeech clean Speech Noise - Additive (test)
WER14.9
28
Target Speaker ExtractionLibri2Mix Clean (test)--
9
Target Speaker ExtractionLibri2Mix Single Speaker (test)
WER10.5
5
Showing 8 of 8 rows

Other info

Follow for update