Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Revisiting ASR Error Correction with Specialized Models

About

Language models play a central role in automatic speech recognition (ASR), yet most methods rely on text-only models unaware of ASR error patterns. Recently, large language models (LLMs) have been applied to ASR correction, but introduce latency and hallucination concerns. We revisit ASR error correction with compact seq2seq models, trained on ASR errors from real and synthetic audio. To scale training, we construct synthetic corpora via cascaded TTS and ASR, finding that matching the diversity of realistic error distributions is key. We propose correction-first decoding, where the correction model generates candidates rescored using ASR acoustic scores. With 15x fewer parameters than LLMs, our model achieves 1.5/3.3% WER on LibriSpeech test-clean/other, outperforms LLMs, generalizes across ASR architectures (CTC, Seq2seq, Transducer) and diverse domains, and provides precise corrections in the low-error regime where LLMs struggle.

Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly• 2024

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER1.5
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER3.3
1151
Automatic Speech RecognitionLibriSpeech Clean
WER2.1
80
Automatic Speech RecognitionTED-LIUM3 (test)
WER3.9
59
Speech RecognitionSwitchboard
WER10.3
20
Speech RecognitionVoxPopuli En
WER6.4
7
Speech RecognitionCALLHOME
WER13.5
6
Automatic Speech RecognitionCHiME-6
WER25.8
2
Automatic Speech RecognitionCommonVoice en
WER10
2
Automatic Speech RecognitionAverage 8 datasets
WER9.5
2
Showing 10 of 10 rows

Other info

Follow for update