Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan

About

Recent advances in Speech Large Language Models (Speech-LLMs) have made significant progress, greatly enhancing multimodal interaction capabilities.However, their application in low-resource and dialect-diverse environments still faces challenges. The severe scarcity of Tibetan data, coupled with the phonetic differences among its major dialects (\"U-Tsang, Amdo, and Kham), is a prime example of this challenge. This paper proposes Ti-Audio, the first multi-dialectal end-to-end Speech-LLM for Tibetan. To efficiently align speech and text, we introduce a Dynamic Q-Former Adapter that extracts essential acoustic features from variable-length speech, ensuring stable cross-modal alignment even with limited data. At the data level, we leverage mutual assistance among related dialects to alleviate data scarcity and employ a temperature-based sampling strategy to maximize this synergy. Experimental results demonstrate that Ti-Audio achieves state-of-the-art performance on Tibetan benchmarks for automatic speech recognition and speech translation. Our work validates the effectiveness of cross-dialectal cooperation and provides a scalable paradigm for the development of Speech-LLM in low-resource scenarios.

Jialing Wang, Yue Zhao, Yuhao Zhang, Jing Yu, Shaosai Li, Zhanchen Dai, Benyou Wang, Haizhou Li• 2026

Related benchmarks

TaskDatasetResultRank
Speech TranslationTibetan Dialects (Amdo, Kham, Ü-Tsang)
BLEU (Amdo)20.59
6
Machine TranslationTibetan Dialects (Amdo, Kham, Ü-Tsang)--
4
Automatic Speech RecognitionTibetan Dialects (Amdo, Kham, Ü-Tsang)
Amdo WER14.25
3
Gender RecognitionTibetan Speech
Precision99.6
3
Speaker Emotion RecognitionSpeaker Emotion Recognition (SER) (test)
Precision (Anger)41.67
3
Showing 5 of 5 rows

Other info

Follow for update