Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection

About

As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we observe the negative effect of the truncated words at the chunk boundaries on the decoding results and propose an integrate-and-fire-based truncation detection model to address this issue. Experiments on multiple languages and Whisper architectures show that Simul-Whisper achieves an average absolute word error rate degradation of only 1.46% at a chunk size of 1 second, which significantly outperforms the current state-of-the-art baseline.

Haoyu Wang, Guoqiang Hu, Guodong Lin, Wei-Qiang Zhang, Jian Li• 2024

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER4.57
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER11.21
1151
Automatic Speech RecognitionTED-LIUM 3
WER6.69
45
Automatic Speech RecognitionMLS FR (test)
WER15.24
13
Automatic Speech RecognitionMLS Spanish
Relative WER11.58
3
Automatic Speech RecognitionMLS German
Relative WER16.2
3
Automatic Speech RecognitionMLS Portuguese
Relative WER18.44
3
Showing 7 of 7 rows

Other info

Follow for update