Bridging the Granularity Gap for Acoustic Modeling
About
While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose \textit{Progressive Down-Sampling} (PDS) which gradually compresses the acoustic features into coarser-grained units containing more complete semantic information, like text-level representation. In addition, we develop a representation fusion method to alleviate information loss that occurs inevitably during high compression. In this way, we compress the acoustic features into 1/32 of the initial length while achieving better or comparable performances on the speech recognition task. And as a bonus, it yields inference speedups ranging from 1.20$\times$ to 1.47$\times$. By reducing the modeling burden, we also achieve competitive results when training on the more challenging speech translation task.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech 960h (test-other) | WER6.38 | 81 | |
| Automatic Speech Recognition | AISHELL-1 (test) | -- | 71 | |
| Automatic Speech Recognition | LibriSpeech 960h (test-clean) | WER0.0272 | 53 | |
| Automatic Speech Recognition | LibriSpeech 960h (dev-other) | WER6.31 | 50 | |
| Speech Recognition | AISHELL-1 (dev) | WER4.72 | 28 | |
| Automatic Speech Recognition | LibriSpeech 960h clean (dev) | WER2.7 | 25 | |
| Automatic Speech Recognition | LibriSpeech 960h | WER4.52 | 20 | |
| Speech Translation | MuST-C En-De (test) | SacreBLEU28.7 | 18 |