Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

About

Multimodal large language models (MLLMs) have demonstrated significant potential for speech-to-text translation (S2TT). However, existing deployment paradigms face critical challenges: pure on-device models suffer from resource constraints, while centralized cloud systems incur severe privacy risks and bandwidth bottlenecks by transmitting raw voice data. Furthermore, most models exhibit English-centric biases, restricting many-to-many translation scaling. In this paper, we propose Edge-cloud Speech Recognition and Translation (ESRT), a privacy-preserving and bandwidth-efficient collaborative edge-cloud MLLM framework. Specifically, we design an edge-cloud split inference architecture that retains a lightweight speech encoder and adapter on the device, transmitting only highly compressed intermediate features to the cloud. This fundamentally prevents voiceprint leakage and reduces bandwidth requirements by up to 10$\times$. To overcome English-centric bottlenecks, we introduce a multi-task weighted curriculum learning strategy with data balancing to ensure robust cross-lingual consistency. Extensive experiments on the FLEURS dataset demonstrate that our models, ESRT-4B and ESRT-12B, achieve state-of-the-art many-to-many S2TT performance across 45 languages ($45 \times 44$ directions). Code and models are released to facilitate reproducible, privacy-aware MLLM S2TT research. The code and models are released at https://github.com/yxduir/esrt.

Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Ming Liu, Bing Qin, Yang Xiang• 2026

Related benchmarks

TaskDatasetResultRank
Machine Translation45 x 44 translation directions
Count (> 90)8
6
Speech-to-text TranslationFLEURS 11 x 44 directions
ARA -> X Score83.3
6
Speech-to-text TranslationFLEURS 44 x 11 directions
S2T Score (X→ara)81
6
Speech-to-text TranslationFleurs
Error Rate (ara -> X)20.8
6
Showing 4 of 4 rows

Other info

Follow for update