Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

About

As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.

Sanchit Gandhi, Patrick von Platen, Alexander M. Rush• 2023

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER2.54
1207
Automatic Speech RecognitionLibriSpeech (test-other)
WER5.19
1206
Automatic Speech RecognitionLibrispeech (test-clean)
WER3.6
96
Automatic Speech RecognitionAMI
WER15.1
35
Automatic Speech RecognitionEarnings-22
WER11.8
29
Automatic Speech RecognitionSPGISpeech
WER4.1
24
Automatic Speech RecognitionTED-LIUM
WER3.86
20
Automatic Speech RecognitionSupreme-court-speech
WER18.9
9
Showing 8 of 8 rows

Other info

Follow for update