Direct speech-to-speech translation with a sequence-to-sequence model
About
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech-to-speech translation | Fisher Spanish-English (test) | BLEU (Speech Input)46.3 | 55 | |
| Speech-to-speech translation | Fisher Spanish-English (dev) | BLEU (Speech)45.5 | 48 | |
| Speech-to-speech translation | Fisher Spanish-English (dev2) | ASR BLEU47.6 | 36 | |
| Speech-to-speech translation | CVSS-C ES → EN (test) | ASR-BLEU8.72 | 16 | |
| Speech-to-speech translation | CVSS-C DE → EN (test) | ASR-BLEU1.97 | 16 | |
| Offline Speech-to-Speech Translation | CVSS-C (test) | Fr-En ASR-BLEU16.96 | 11 | |
| S2ST Metric Evaluation | S2ST es→en (test) | Pearson Correlation0.3226 | 8 | |
| S2ST Metric Evaluation | S2ST ru→en (test) | Pearson Correlation0.1588 | 8 | |
| S2ST Metric Evaluation | S2ST hk→en (test) | Pearson Correlation0.2863 | 8 | |
| S2ST Metric Evaluation | S2ST fr→en (test) | Pearson Correlation0.3277 | 8 |