Convoifilter: A case study of doing cocktail party speech recognition
About
This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise (ConVoiFilter) and an ASR module. The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER10.2 | 1156 | |
| Automatic Speech Recognition | LibriSpeech other Speech Noise - Reverb (test) | WER37.1 | 28 | |
| Automatic Speech Recognition | LibriSpeech clean Speech Noise - Reverb (test) | WER29.8 | 28 | |
| Automatic Speech Recognition | LibriSpeech other Speech Noise - Additive (test) | WER20.6 | 28 | |
| Automatic Speech Recognition | LibriSpeech Clean other (test) | WER11.3 | 28 | |
| Automatic Speech Recognition | LibriSpeech clean Speech Noise - Additive (test) | WER14.9 | 28 | |
| Target Speaker Extraction | Libri2Mix Clean (test) | -- | 9 | |
| Target Speaker Extraction | Libri2Mix Single Speaker (test) | WER10.5 | 5 |