Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning

About

The advent of Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR). Prompting LLM with audio embeddings to generate transcriptions becomes the new state-of-the-art ASR. Despite LLMs being trained with an extensive amount of text corpora, high-quality domain-specific text data can still significantly enhance ASR performance on domain adaptation tasks. Although LLM-based ASR can naturally incorporate more text corpora by fine-tuning the LLM decoder, fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge. To mitigate this issue, we propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation. Experimental results show that text adaptation with our proposed method achieved a relative up to 9% Word Error Rate (WER) reduction and up to 18% Entity Error Rate (EER) reduction on the target domain compared to the baseline ASR. Combining this with domain-specific Language Model (LM) fusion can further improve the EER by a relative 2-5%

Yingyi Ma, Zhe Liu, Ozlem Kalinli• 2024

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionSPGISpeech
WER11.1
24
Automated Speech RecognitionSlideSpeech MI
WER0.1335
10
Automatic Speech RecognitionSlideSpeech
WER16.4
10
Automated Speech RecognitionSlideSpeech Ag
WER14.23
10
Automatic Speech RecognitionCSJ (eval1)
CER21
7
Automatic Speech RecognitionCSJ (eval2)
CER19.9
7
Automated Speech RecognitionSlideSpeech An
WER25.95
5
Automatic Speech RecognitionDefinedAI Banking target domain (test)
WER10.63
5
Automatic Speech RecognitionDefinedAI Insurance target domain (test)
WER9.68
5
Automatic Speech RecognitionSlideSpeech target
WER15.71
5
Showing 10 of 10 rows

Other info

Follow for update