Zipformer: A faster and better encoder for automatic speech recognition

About

The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.

Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, Daniel Povey• 2023

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER1.6	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER3.6	1206
Automatic Speech Recognition	AISHELL-1 (test)	CER4.28	105
Automatic Speech Recognition	Librispeech (test-clean)	WER2	96
Automatic Speech Recognition	WenetSpeech Meeting (test)	CER12.06	78
Automatic Speech Recognition	WenetSpeech Net (test)	CER7.24	57
Automatic Speech Recognition	AISHELL-1 (dev)	CER4.03	57
Automatic Speech Recognition	GigaSpeech (test)	WER10.2	48
Speech Recognition	AISHELL-1 (dev)	WER4	28
Phone Feature Recognition	Buckeye (sociophonetic)	PFER6.8	25

Showing 10 of 20 rows

Other info

Code

Follow for update

@wizwand_team Discord