Revisiting ASR Error Correction with Specialized Models

About

Language models play a central role in automatic speech recognition (ASR), yet most methods rely on text-only models unaware of ASR error patterns. Recently, large language models (LLMs) have been applied to ASR correction, but introduce latency and hallucination concerns. We revisit ASR error correction with compact seq2seq models, trained on ASR errors from real and synthetic audio. To scale training, we construct synthetic corpora via cascaded TTS and ASR, finding that matching the diversity of realistic error distributions is key. We propose correction-first decoding, where the correction model generates candidates rescored using ASR acoustic scores. With 15x fewer parameters than LLMs, our model achieves 1.5/3.3% WER on LibriSpeech test-clean/other, outperforms LLMs, generalizes across ASR architectures (CTC, Seq2seq, Transducer) and diverse domains, and provides precise corrections in the low-error regime where LLMs struggle.

Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly• 2024

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER1.5	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER3.3	1206
Automatic Speech Recognition	LibriSpeech Clean	WER2.1	107
Automatic Speech Recognition	TED-LIUM3 (test)	WER3.9	59
Speech Recognition	Switchboard	WER10.3	20
Speech Recognition	VoxPopuli En	WER6.4	7
Speech Recognition	CALLHOME	WER13.5	6
Automatic Speech Recognition	CHiME-6	WER25.8	2
Automatic Speech Recognition	CommonVoice en	WER10	2
Automatic Speech Recognition	Average 8 datasets	WER9.5	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord