E-Branchformer: Branchformer with Enhanced merging for speech recognition

About

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.

Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J. Han, Shinji Watanabe• 2022

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER1.8	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER3.65	1206
Automatic Speech Recognition	Librispeech (test-clean)	WER2.14	96
Automatic Speech Recognition	SEAME Man (dev)	MER16.4	33
Automatic Speech Recognition	SEAME SGE (dev)	MER23.2	33
Automatic Speech Recognition	ASRU 2019 (test)	Match Error Rate (MER)11.8	32

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord