Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning
About
The advent of Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR). Prompting LLM with audio embeddings to generate transcriptions becomes the new state-of-the-art ASR. Despite LLMs being trained with an extensive amount of text corpora, high-quality domain-specific text data can still significantly enhance ASR performance on domain adaptation tasks. Although LLM-based ASR can naturally incorporate more text corpora by fine-tuning the LLM decoder, fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge. To mitigate this issue, we propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation. Experimental results show that text adaptation with our proposed method achieved a relative up to 9% Word Error Rate (WER) reduction and up to 18% Entity Error Rate (EER) reduction on the target domain compared to the baseline ASR. Combining this with domain-specific Language Model (LM) fusion can further improve the EER by a relative 2-5%
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automated Speech Recognition | SlideSpeech MI | WER0.1335 | 10 | |
| Automated Speech Recognition | SlideSpeech Ag | WER14.23 | 10 | |
| Automated Speech Recognition | SlideSpeech An | WER25.95 | 5 | |
| Automatic Speech Recognition | DefinedAI Banking target domain (test) | WER10.63 | 5 | |
| Automatic Speech Recognition | DefinedAI Insurance target domain (test) | WER9.68 | 5 | |
| Automatic Speech Recognition | SlideSpeech target | WER15.71 | 5 |