Text-only adaptation in LLM-based ASR through text denoising

About

Adapting large language model (LLM)-based automatic speech recognition (ASR) systems to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on the target domain text often disrupts the critical alignment between the speech and text modality learned by the projector, degrading performance. We introduce a novel text-only adaptation method that frames this process as a text denoising task. Our approach trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.

Andr\'es Carofilis, Sergio Burdisso, Esa\'u Villatoro-Tello, Shashi Kumar, Kadri Hacioglu, Srikanth Madikeri, Pradeep Rangappa, Manjunath K E, Petr Motlicek, Shankar Venkatesan, Andreas Stolcke• 2026

Related benchmarks

Task	Dataset	Result
Automated Speech Recognition	SlideSpeech Ag	WER14.21	10
Automated Speech Recognition	SlideSpeech MI	WER0.1343	10
Automated Speech Recognition	SlideSpeech An	WER25.32	5
Automatic Speech Recognition	DefinedAI Banking target domain (test)	WER10.11	5
Automatic Speech Recognition	DefinedAI Insurance target domain (test)	WER8.71	5
Automatic Speech Recognition	SlideSpeech target	WER14.6	5

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord