Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Bridging the Granularity Gap for Acoustic Modeling

About

While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose \textit{Progressive Down-Sampling} (PDS) which gradually compresses the acoustic features into coarser-grained units containing more complete semantic information, like text-level representation. In addition, we develop a representation fusion method to alleviate information loss that occurs inevitably during high compression. In this way, we compress the acoustic features into 1/32 of the initial length while achieving better or comparable performances on the speech recognition task. And as a bonus, it yields inference speedups ranging from 1.20$\times$ to 1.47$\times$. By reducing the modeling burden, we also achieve competitive results when training on the more challenging speech translation task.

Chen Xu, Yuhao Zhang, Chengbo Jiao, Xiaoqian Liu, Chi Hu, Xin Zeng, Tong Xiao, Anxiang Ma, Huizhen Wang, JingBo Zhu• 2023

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech 960h (test-other)
WER6.38
81
Automatic Speech RecognitionAISHELL-1 (test)--
71
Automatic Speech RecognitionLibriSpeech 960h (test-clean)
WER0.0272
53
Automatic Speech RecognitionLibriSpeech 960h (dev-other)
WER6.31
50
Speech RecognitionAISHELL-1 (dev)
WER4.72
28
Automatic Speech RecognitionLibriSpeech 960h clean (dev)
WER2.7
25
Automatic Speech RecognitionLibriSpeech 960h
WER4.52
20
Speech TranslationMuST-C En-De (test)
SacreBLEU28.7
18
Showing 8 of 8 rows

Other info

Code

Follow for update