Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

About

We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also demonstrate that this model can be effectively fine-tuned on new datasets.

Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Yang Zhang• 2019

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech (test-other)
WER7.25
966
Automatic Speech RecognitionLibriSpeech clean (test)
WER2.69
833
Automatic Speech RecognitionLibriSpeech (dev-other)
WER11.58
411
Automatic Speech RecognitionLibriSpeech (dev-clean)
WER (%)3.98
319
Speech RecognitionWSJ (92-eval)
WER4.5
131
Speech RecognitionWSJ 93 (test)
WER7
13
ASR Accent AdaptationIndicTTS ASM
WER27.1
8
ASR Accent AdaptationIndicTTS GUJ
WER13.7
8
ASR Accent AdaptationIndicTTS HIN
WER11.1
8
ASR Accent AdaptationIndicTTS KAN
WER18.7
8
Showing 10 of 22 rows

Other info

Code

Follow for update