From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

About

Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.

Yu Pan, Yang Hou, Xiongfei Wu, Liang Zhang, Yves Le Traon, Lei Ma, Jianjun Zhao• 2026

Related benchmarks

Task	Dataset	Result
Speech-to-speech translation	CVSS-C Fr→En	ASR-BLEU34.72	11
Speech-to-speech translation	CVSS-C De→En	ASR BLEU33.16	10
Speech-to-speech translation	CVSS-C Es→En	ASR-BLEU37.13	10
Speech-to-speech translation	CVSS-C Average	ASR-BLEU35	10
Speech-to-speech translation	CVSS-C	Fr->En BLASER 2.0 Score4.21	7
Speech-to-text Translation	CVSS-C Fr→En	COMET Score82.74	5
Speech-to-text Translation	CVSS-C De→En	COMET Score82.16	5
Speech-to-text Translation	CVSS-C Es→En	COMET Score85.03	5
Speech-to-text Translation	CVSS-C Average	COMET Score83.31	5
Speech-to-speech translation	CVSS-C Japanese-to-English	BLEU22	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord