Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages
About
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER9.8 | 966 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER9.5 | 411 | |
| Speech Translation | CoVoST-2 (test) | Avg BLEU (15 Dir)24.4 | 46 | |
| Speech-to-text Translation | CoVoST low-resource X-to-En 2 (test) | BLEU (Avg)5.3 | 24 | |
| Speech-to-text Translation | CoVoST-2 high-resource X-to-En (test) | Quality Score (Fr)33.2 | 8 | |
| Spoken Named Entity Recognition | SLUE-VoxPopuli (dev) | F1 Score0.717 | 7 | |
| Spoken Named Entity Recognition | SLUE-VoxPopuli (test) | F1 Score65.4 | 6 |