PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs
About
Although Large Language Models (LLMs) excel in many tasks, their application to Speech-to-Speech Translation (S2ST) is underexplored and hindered by data scarcity. To bridge this gap, we propose PROST-LLM (PROgressive Speech-to-speech Translation) to enhance the S2ST capabilities in LLMs progressively. First, we fine-tune the LLMs with the CVSS corpus, employing designed tri-task learning and chain of modality methods to boost the initial performance. Then, leveraging the fine-tuned model, we generate preference pairs through self-sampling and back-translation without human evaluation. Finally, these preference pairs are used for preference optimization to enhance the model's S2ST capability further. Extensive experiments confirm the effectiveness of our proposed PROST-LLM in improving the S2ST capability of LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Naturalness Assessment | CVSS-C | UTMOS Score3.74 | 10 | |
| Speech Naturalness Assessment | CVSS-T | UTMOS3.67 | 10 | |
| Speech-to-Speech Translation (en2fra) | CVSS-C | BLEU25.12 | 10 | |
| Speech-to-Speech Translation (en2fra) | CVSS-T | BLEU Score22.54 | 10 | |
| Speech-to-Speech Translation (fra2en) | CVSS-C | BLEU0.2178 | 10 | |
| Speech-to-Speech Translation (fra2en) | CVSS-T | BLEU Score19.49 | 10 | |
| Speech-to-Text Translation (en2fra) | CVSS-C | BLEU29.97 | 7 | |
| Speech-to-Text Translation (en2fra) | CVSS-T | BLEU26.13 | 7 | |
| Speech-to-Text Translation (fra2en) | CVSS-C | BLEU23.04 | 7 | |
| Speech-to-Text Translation (fra2en) | CVSS-T | BLEU Score20.94 | 7 |