Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

About

Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.

Max Bain, Jaesung Huh, Tengda Han, Andrew Zisserman• 2023

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionAMI IHM
WER22.5
12
Forced AlignmentMFA-Labeled Raw (test)
AAS Latency (Avg)133.2
8
Speech Recognition and DiarizationAMI IHM
WER22.51
6
Forced AlignmentGTSinger-Speech-ZH
AAS221.3
5
Forced AlignmentLibriSpeech Other
AAS96.64
5
Forced AlignmentLibriSpeech Clean
AAS87.02
5
Speaker DiarizationCompetition Audio (train)
DER0.248
5
Forced AlignmentHuman-Labeled (test)
Avg. RTF0.0113
4
Forced AlignmentMFA-labeled Long-form (test)
Average Alignment Value2.71e+3
4
Speaker DiarizationStoryGen Eval
tcpWER55.9
3
Showing 10 of 11 rows

Other info

Follow for update