From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation
About
Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech-to-speech translation | CVSS-C Fr→En | ASR-BLEU34.72 | 11 | |
| Speech-to-speech translation | CVSS-C De→En | ASR BLEU33.16 | 10 | |
| Speech-to-speech translation | CVSS-C Es→En | ASR-BLEU37.13 | 10 | |
| Speech-to-speech translation | CVSS-C Average | ASR-BLEU35 | 10 | |
| Speech-to-speech translation | CVSS-C | Fr->En BLASER 2.0 Score4.21 | 7 | |
| Speech-to-text Translation | CVSS-C Fr→En | COMET Score82.74 | 5 | |
| Speech-to-text Translation | CVSS-C De→En | COMET Score82.16 | 5 | |
| Speech-to-text Translation | CVSS-C Es→En | COMET Score85.03 | 5 | |
| Speech-to-text Translation | CVSS-C Average | COMET Score83.31 | 5 | |
| Speech-to-speech translation | CVSS-C Japanese-to-English | BLEU22 | 2 |