Self-Training for End-to-End Speech Translation
About
One of the main challenges for end-to-end speech translation is data scarcity. We leverage pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech translation model. This provides 8.3 and 5.7 BLEU gains over a strong semi-supervised baseline on the MuST-C English-French and English-German datasets, reaching state-of-the art performance. The effect of the quality of the pseudo-labels is investigated. Our approach is shown to be more effective than simply pre-training the encoder on the speech recognition task. Finally, we demonstrate the effectiveness of self-training by directly generating pseudo-labels with an end-to-end model instead of a cascade model.
Juan Pino, Qiantong Xu, Xutai Ma, Mohammad Javad Dousti, Yun Tang• 2020
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Translation | MuST-C EN-DE (test-COMMON) | BLEU25.2 | 41 | |
| Speech Translation | MuST-C EN-FR COMMON (test) | BLEU34.5 | 17 | |
| Speech-to-text Translation | MuST-C En-X (tst-COM) | BLEU (German)25.2 | 16 |
Showing 3 of 3 rows