Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects

About

Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs. We release the benchmark at https://github.com/kalvinchang/yubao.

Kalvin Chang, Yiwen Shao, Jiahong Li, Dong Yu• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionCV-yue
CER7.44
10
Automatic Speech RecognitionKeSpeech
CER5.45
10
Automatic Speech RecognitionMDCC
CER6.28
4
Automatic Speech RecognitionShanghai
CER10.02
4
Automatic Speech RecognitionShanghai 2
CER7.36
4
Automatic Speech RecognitionHangzhou
CER4.95
4
Automatic Speech RecognitionSuzhou
CER10.17
4
Automatic Speech RecognitionChaoshan
CER10.6
4
Automatic Speech RecognitionHokkien
CER21.34
4
Automatic Speech RecognitionHokkien 2
CER21
4
Showing 10 of 12 rows

Other info

Follow for update