Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning

About

The advent of Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR). Prompting LLM with audio embeddings to generate transcriptions becomes the new state-of-the-art ASR. Despite LLMs being trained with an extensive amount of text corpora, high-quality domain-specific text data can still significantly enhance ASR performance on domain adaptation tasks. Although LLM-based ASR can naturally incorporate more text corpora by fine-tuning the LLM decoder, fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge. To mitigate this issue, we propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation. Experimental results show that text adaptation with our proposed method achieved a relative up to 9% Word Error Rate (WER) reduction and up to 18% Entity Error Rate (EER) reduction on the target domain compared to the baseline ASR. Combining this with domain-specific Language Model (LM) fusion can further improve the EER by a relative 2-5%

Yingyi Ma, Zhe Liu, Ozlem Kalinli• 2024

Related benchmarks

TaskDatasetResultRank
Automated Speech RecognitionSlideSpeech MI
WER0.1335
10
Automated Speech RecognitionSlideSpeech Ag
WER14.23
10
Automated Speech RecognitionSlideSpeech An
WER25.95
5
Automatic Speech RecognitionDefinedAI Banking target domain (test)
WER10.63
5
Automatic Speech RecognitionDefinedAI Insurance target domain (test)
WER9.68
5
Automatic Speech RecognitionSlideSpeech target
WER15.71
5
Showing 6 of 6 rows

Other info

Follow for update