CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions

About

We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder's cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open https://github.com/nyrahealth/CrisperWhisper.

Laurin Wagner, Bernhard Thallinger, Mario Zusag• 2024

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER1.82	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER4	1206
Automatic Speech Recognition	LibriSpeech Other	WER3.72	123
Automatic Speech Recognition	LibriSpeech Clean	WER1.71	107
Automatic Speech Recognition	VoxPopuli	WER6.03	38
Automatic Speech Recognition	AMI	WER8.43	35
Automatic Speech Recognition	Earnings-22	WER12.9	29
Automatic Speech Recognition	Common Voice	WER7.76	22
Automatic Speech Recognition	TED-LIUM	WER3.2	20
Automatic Speech Recognition	MLS	WER5.26	7

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord