Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

About

Code-switching (CS) speech translation (ST) aims to translate speech that alternates between multiple languages into a target language text, posing significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies mainly rely on the models themselves to implicitly learn semantic representations and resort to costly manual annotations. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture-of-Experts (MoE) speech projector composed of language expert groups, where each group specializes in the semantic space of a specific language for fine-grained speech feature modeling. A language-specific loss and an intra-group load balancing loss are jointly introduced to guide efficient token routing across and within expert groups. Furthermore, we introduce a multi-stage training paradigm that utilizes readily available automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation performance. To bridge the data gap for smooth domain transfer, a transition loss is employed to improve adaptation to CS scenarios. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach, achieving average improvements of $0.86$ BLEU and $0.93$ COMET over SeamlessM4T, with maximum improvements of $1.49$ BLEU and $1.41$ COMET across different test sets.

Yan Gao, Yazheng Yang, Zhibin Lan, Yidong Chen, Min Zhang, Daimeng Wei, Derek F. Wong, Jinsong Su• 2025

Related benchmarks

TaskDatasetResultRank
Speech TranslationNTUML Code-Switching 2021 (test)
BLEU39.52
18
Speech TranslationFisher Code-Switching (test)
BLEU37.51
11
Speech TranslationFisher Monolingual (test)
BLEU35.87
11
Speech TranslationNTUML Monolingual 2021 (test)
BLEU36.27
11
Showing 4 of 4 rows

Other info

Follow for update