CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency
About
In this paper, we present a new open source toolkit for speech recognition, named CAT (CTC-CRF based ASR Toolkit). CAT inherits the data-efficiency of the hybrid approach and the simplicity of the E2E approach, providing a full-fledged implementation of CTC-CRFs and complete training and testing scripts for a number of English and Chinese benchmarks. Experiments show CAT obtains state-of-the-art results, which are comparable to the fine-tuned hybrid models in Kaldi but with a much simpler training pipeline. Compared to existing non-modularized E2E models, CAT performs better on limited-scale datasets, demonstrating its data efficiency. Furthermore, we propose a new method called contextualized soft forgetting, which enables CAT to do streaming ASR without accuracy degradation. We hope CAT, especially the CTC-CRF based framework and software, will be of broad interest to the community, and can be further explored and improved.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Recognition | WSJ (92-eval) | WER3.2 | 131 | |
| Speech Recognition | WSJ nov93 (dev) | WER5.7 | 52 | |
| Automatic Speech Recognition | Hub5 2000 (SWB) | WER7.3 | 21 | |
| Automatic Speech Recognition | AISHELL (test) | CER6.34 | 20 | |
| Automatic Speech Recognition | 80-hour WSJ (dev93) | WER5.7 | 16 | |
| Automatic Speech Recognition | Eval2000-CH Fisher-Switchboard 2300-h (test) | WER (SW Subset)9.8 | 10 | |
| Automatic Speech Recognition | Eval2000 Fisher-Switchboard 2300-h (test) | WER11.2 | 9 | |
| Speech Recognition | Switchboard Eval2000 | SW Error Rate9.7 | 9 |