Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning

About

The advent of Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR). Prompting LLM with audio embeddings to generate transcriptions becomes the new state-of-the-art ASR. Despite LLMs being trained with an extensive amount of text corpora, high-quality domain-specific text data can still significantly enhance ASR performance on domain adaptation tasks. Although LLM-based ASR can naturally incorporate more text corpora by fine-tuning the LLM decoder, fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge. To mitigate this issue, we propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation. Experimental results show that text adaptation with our proposed method achieved a relative up to 9% Word Error Rate (WER) reduction and up to 18% Entity Error Rate (EER) reduction on the target domain compared to the baseline ASR. Combining this with domain-specific Language Model (LM) fusion can further improve the EER by a relative 2-5%

Yingyi Ma, Zhe Liu, Ozlem Kalinli• 2024

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	SPGISpeech	WER11.1	24
Automated Speech Recognition	SlideSpeech MI	WER0.1335	10
Automatic Speech Recognition	SlideSpeech	WER16.4	10
Automated Speech Recognition	SlideSpeech Ag	WER14.23	10
Automatic Speech Recognition	CSJ (eval1)	CER21	7
Automatic Speech Recognition	CSJ (eval2)	CER19.9	7
Automated Speech Recognition	SlideSpeech An	WER25.95	5
Automatic Speech Recognition	DefinedAI Banking target domain (test)	WER10.63	5
Automatic Speech Recognition	DefinedAI Insurance target domain (test)	WER9.68	5
Automatic Speech Recognition	SlideSpeech target	WER15.71	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord