Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages

About

Tonal low-resource languages are widely spoken yet remain underserved by modern speech technology. A key challenge is learning representations that are robust to nuisance variation such as gender while remaining tone-aware for different lexical meanings. To address this, we propose SITA, a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders. SITA uses staged multi-objective training: (i) a cross-gender contrastive objective encourages lexical consistency across speakers, while a tone-repulsive loss prevents tone collapse by explicitly separating same-word different-tone realizations; and (ii) an auxiliary Connectionist Temporal Classification (CTC)-based ASR objective with distillation stabilizes recognition-relevant structure. We evaluate primarily on Hmong, a highly tonal and severely under-resourced language where off-the-shelf multilingual encoders fail to represent tone effectively. On a curated Hmong word corpus, SITA improves cross-gender lexical retrieval accuracy, while maintaining usable ASR accuracy relative to an ASR-adapted XLS-R teacher. We further observe similar gains when transferring the same recipe to Mandarin, suggesting SITA is a general, plug-in approach for adapting multilingual speech encoders to tonal languages.

Tianyi Xu, Xuan Ouyang, Binwei Yao, Shoua Xiong, Sara Misurelli, Maichou Lor, Junjie Hu• 2026

Related benchmarks

TaskDatasetResultRank
Cross-gender word retrievalHmong (M→F) cross-gender (unseen-speaker)
Top-1 Accuracy68.7
12
Cross-gender RetrievalMandarin M→F
Top-199.27
8
Tone GeometryMandarin
PosSim96.5
8
Automatic Speech RecognitionMandarin
CER0.73
7
Cross-gender word retrievalHmong F→M (test)
Top-1 Accuracy62.86
6
Word RetrievalTone Perfect Mandarin, Male→Female (test)
Top-1 Acc99.27
5
Automatic Speech RecognitionHmong word+tone (test)
CER0.1985
4
Showing 7 of 7 rows

Other info

Follow for update