POTSA: A Cross-Lingual Speech Alignment Framework for Speech-to-Text Translation
About
Speech Large Language Models have achieved breakthroughs in multilingual speech-to-text translation. However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose POTSA (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport, designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations. Second, we impose token-level OT constraints on a Q-Former using parallel pairs to establish fine-grained representation consistency. Then, we apply a layer scheduling strategy to focus OT constraints on semantically beneficial layers. Experiments on FLEURS show our method achieves SOTA performance, with +1.29 BLEU over five common languages and +2.93 BLEU on zero-shot languages, using only 10 hours of parallel speech per language.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech-to-text Translation | FLEURS supervised training languages (test) | BLEU (en→zh)40.87 | 9 | |
| Speech-to-text Translation | FLEURS zero-shot evaluation languages (test) | Average BLEU20.84 | 8 |