Revisiting ASR Error Correction with Specialized Models
About
Language models play a central role in automatic speech recognition (ASR), yet most methods rely on text-only models unaware of ASR error patterns. Recently, large language models (LLMs) have been applied to ASR correction, but introduce latency and hallucination concerns. We revisit ASR error correction with compact seq2seq models, trained on ASR errors from real and synthetic audio. To scale training, we construct synthetic corpora via cascaded TTS and ASR, finding that matching the diversity of realistic error distributions is key. We propose correction-first decoding, where the correction model generates candidates rescored using ASR acoustic scores. With 15x fewer parameters than LLMs, our model achieves 1.5/3.3% WER on LibriSpeech test-clean/other, outperforms LLMs, generalizes across ASR architectures (CTC, Seq2seq, Transducer) and diverse domains, and provides precise corrections in the low-error regime where LLMs struggle.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER1.5 | 1156 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER3.3 | 1151 | |
| Automatic Speech Recognition | LibriSpeech Clean | WER2.1 | 80 | |
| Automatic Speech Recognition | TED-LIUM3 (test) | WER3.9 | 59 | |
| Speech Recognition | Switchboard | WER10.3 | 20 | |
| Speech Recognition | VoxPopuli En | WER6.4 | 7 | |
| Speech Recognition | CALLHOME | WER13.5 | 6 | |
| Automatic Speech Recognition | CHiME-6 | WER25.8 | 2 | |
| Automatic Speech Recognition | CommonVoice en | WER10 | 2 | |
| Automatic Speech Recognition | Average 8 datasets | WER9.5 | 2 |