CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions
About
We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder's cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open https://github.com/nyrahealth/CrisperWhisper.
Laurin Wagner, Bernhard Thallinger, Mario Zusag• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER4 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.82 | 833 | |
| Automatic Speech Recognition | AMI | WER9.89 | 28 | |
| Automatic Speech Recognition | Earnings-22 | WER12.9 | 25 | |
| Automatic Speech Recognition | TED-LIUM | WER3.2 | 9 |
Showing 5 of 5 rows