QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

About

We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also demonstrate that this model can be effectively fine-tuned on new datasets.

Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Yang Zhang• 2019

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER2.69	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER7.25	1206
Automatic Speech Recognition	LibriSpeech (dev-other)	WER11.58	486
Automatic Speech Recognition	LibriSpeech (dev-clean)	WER (%)3.98	340
Speech Recognition	WSJ (92-eval)	WER4.5	131
Speech Recognition	WSJ 93 (test)	WER7	13
ASR Accent Adaptation	IndicTTS ASM	WER27.1	8
ASR Accent Adaptation	IndicTTS GUJ	WER13.7	8
ASR Accent Adaptation	IndicTTS HIN	WER11.1	8
ASR Accent Adaptation	IndicTTS KAN	WER18.7	8

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord