ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
About
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the broadening interests of the spoken language translation community. ESPnet-ST-v2 supports 1) offline speech-to-text translation (ST), 2) simultaneous speech-to-text translation (SST), and 3) offline speech-to-speech translation (S2ST) -- each task is supported with a wide variety of approaches, differentiating ESPnet-ST-v2 from other open source spoken language translation toolkits. This toolkit offers state-of-the-art architectures such as transducers, hybrid CTC/attention, multi-decoders with searchable intermediates, time-synchronous blockwise CTC/attention, Translatotron models, and direct discrete unit models. In this paper, we describe the overall design, example models for each task, and performance benchmarking behind ESPnet-ST-v2, which is publicly available at https://github.com/espnet/espnet.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Simultaneous Speech Translation | MuST-C EN-DE (tst-COMMON) | BLEU23.5 | 39 | |
| Speech-to-speech translation | CVSS-C Es-En v1 (test) | ASR-BLEU32 | 8 | |
| Streaming Speech-to-Text Translation | IWSLT En-De tst-COMMON 2022 v2 (test) | BLEU26.6 | 5 | |
| Offline Speech Translation | MuST-C v1 (test) | BLEU (DE)27.9 | 4 | |
| Simultaneous Speech Translation | MuST-C v1 (test) | BLEU (DE)23.5 | 2 | |
| Offline Speech-to-Speech Translation | CVSS-C v1 (test) | ASR-BLEU (DE)23.7 | 2 |