Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning

About

Recent advances in automatic speech recognition (ASR) have combined speech encoders with large language models (LLMs) through projection, forming Speech LLMs with strong performance. However, adapting them to new domains remains challenging, especially in low-resource settings where paired speech-text data is scarce. We propose a text-only fine-tuning strategy for Speech LLMs using unpaired target-domain text without requiring additional audio. To preserve speech-text alignment, we introduce a real-time evaluation mechanism during fine-tuning. This enables effective domain adaptation while maintaining source-domain performance. Experiments on LibriSpeech, SlideSpeech, and Medical datasets show that our method achieves competitive recognition performance, with minimal degradation compared to full audio-text fine-tuning. It also improves generalization to new domains without catastrophic forgetting, highlighting the potential of text-only fine-tuning for low-resource domain adaptation of ASR.

Yangui Fang, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong, Kai Yu• 2025

Related benchmarks

TaskDatasetResultRank
Automated Speech RecognitionSlideSpeech Ag
WER14.47
10
Automated Speech RecognitionSlideSpeech MI
WER0.137
10
Automatic Speech RecognitionSlideSpeech target
WER15.3
5
Automated Speech RecognitionSlideSpeech An
WER27.83
5
Automatic Speech RecognitionDefinedAI Banking target domain (test)
WER10.92
5
Automatic Speech RecognitionDefinedAI Insurance target domain (test)
WER9.79
5
Showing 6 of 6 rows

Other info

Follow for update