S2ST-Omni: Hierarchical Language-Aware SpeechLLM Adaptation for Multilingual Speech-to-Speech Translation
About
Despite recent advances in speech-to-speech translation (S2ST), it remains difficult to achieve both high translation accuracy and practical flexibility. In this paper, we present S2ST-Omni, a compositional S2ST framework that integrates a high-accuracy speech-to-text translation (S2TT) frontend with a modular, plug-and-play text-to-speech (TTS) backend, enabling independent optimization of translation and synthesis. On the S2TT side, we introduce a hybrid adapter that follows a "local-then-global" strategy to bridge a pretrained Whisper encoder and a Qwen3 LLM, yielding a hierarchical acoustic-to-semantic abstraction. Building on this bridge, we further propose a hierarchical language-aware architecture that injects source-language information at two complementary levels. At the acoustic level, Language-Aware Dual-CTC operates on intermediate adapter features and employs FiLM-style feature modulation with a learnable gate, encouraging the model to learn language-specific but content-faithful acoustic representations. At the linguistic level, Language-Aware Prompting dynamically constructs source-language-conditioned prompts that activate language-specific translation knowledge in the LLM. To enable efficient optimization, we design a task-specific progressive fine-tuning strategy that first stabilizes speech-text alignment and then improves translation via LoRA on top of this converged foundation. The TTS backend remains fully modular and can be instantiated with any state-of-the-art synthesizer without retraining the S2TT frontend. Experiments on CVSS-C show that S2ST-Omni consistently achieves the best BLEU and ASR-BLEU across French, German, and Spanish to English directions, outperforming strong recent S2ST baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech-to-speech translation | CVSS-C Fr→En | ASR-BLEU33.2 | 11 | |
| Speech-to-speech translation | CVSS-C De→En | ASR BLEU31.25 | 10 | |
| Speech-to-speech translation | CVSS-C Es→En | ASR-BLEU35.9 | 10 | |
| Speech-to-speech translation | CVSS-C Average | ASR-BLEU33.45 | 10 | |
| Speech-to-speech translation | CVSS-C | Fr->En BLASER 2.0 Score4.12 | 7 | |
| Speech-to-text Translation | CVSS-C Fr→En | COMET Score81.94 | 5 | |
| Speech-to-text Translation | CVSS-C De→En | COMET Score80.73 | 5 | |
| Speech-to-text Translation | CVSS-C Es→En | COMET Score83.39 | 5 | |
| Speech-to-text Translation | CVSS-C Average | COMET Score82.02 | 5 | |
| Speech-to-speech translation | CVSS-C Japanese-to-English | BLEU19.61 | 2 |