Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection
About
As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we observe the negative effect of the truncated words at the chunk boundaries on the decoding results and propose an integrate-and-fire-based truncation detection model to address this issue. Experiments on multiple languages and Whisper architectures show that Simul-Whisper achieves an average absolute word error rate degradation of only 1.46% at a chunk size of 1 second, which significantly outperforms the current state-of-the-art baseline.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER4.57 | 1156 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER11.21 | 1151 | |
| Automatic Speech Recognition | TED-LIUM 3 | WER6.69 | 45 | |
| Automatic Speech Recognition | MLS FR (test) | WER15.24 | 13 | |
| Automatic Speech Recognition | MLS Spanish | Relative WER11.58 | 3 | |
| Automatic Speech Recognition | MLS German | Relative WER16.2 | 3 | |
| Automatic Speech Recognition | MLS Portuguese | Relative WER18.44 | 3 |