Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech

About

End-to-end speech-to-speech translation (S2ST) systems typically struggle with a critical data bottleneck: the scarcity of parallel speech-to-speech corpora. To overcome this, we introduce RosettaSpeech, a novel zero-shot framework trained exclusively on monolingual speech-text data augmented by machine translation supervision. Unlike prior works that rely on complex cascaded pseudo-labeling, our approach strategically utilizes text as a semantic bridge during training to synthesize translation targets, thereby eliminating the need for parallel speech pairs while maintaining a direct, end-to-end inference pipeline. Empirical evaluations on the CVSS-C benchmark demonstrate that RosettaSpeech achieves state-of-the-art zero-shot performance, surpassing leading baselines by significant margins - achieving ASR-BLEU scores of 25.17 for German-to-English (+27% relative gain) and 29.86 for Spanish-to-English (+14%). Crucially, our model effectively preserves the source speaker's voice without ever seeing paired speech data. We further analyze the impact of data scaling and demonstrate the model's capability in many-to-one translation, offering a scalable solution for extending high-quality S2ST to "text-rich, speech-poor" languages.

Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, Abhishek Yanamandra, Abhinav Jain, Zhu Liu, Sunil Hadap, Vimal Bhat, Manoj Aggarwal, Gerard Medioni, David Harwath• 2025

Related benchmarks

TaskDatasetResultRank
Speech-to-speech translationCVSS-C ES → EN (test)
ASR-BLEU33.05
16
Speech-to-speech translationCVSS-C DE → EN (test)
ASR-BLEU29.9
16
Speech-to-speech translationCVSS-C Fr→En
ASR-BLEU32.16
11
Speech-to-speech translationCVSS-C De→En
ASR BLEU21.54
10
Speech-to-speech translationCVSS-C Es→En
ASR-BLEU29.35
10
Speech-to-speech translationCVSS-C Average
ASR-BLEU27.68
10
Speech-to-speech translationCVSS-C
Fr->En BLASER 2.0 Score4.04
7
Speech-to-text TranslationCVSS-C Fr→En
COMET Score78.97
5
Speech-to-text TranslationCVSS-C De→En
COMET Score79.65
5
Speech-to-text TranslationCVSS-C Es→En
COMET Score82.05
5
Showing 10 of 11 rows

Other info

Follow for update