Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Text-only adaptation in LLM-based ASR through text denoising

About

Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.

Sergio Burdisso, Esa\'u Villatoro-Tello, Andr\'es Carofilis, Shashi Kumar, Kadri Hacioglu, Srikanth Madikeri, Pradeep Rangappa, Manjunath K E, Petr Motlicek, Shankar Venkatesan, Andreas Stolcke• 2026

Related benchmarks

TaskDatasetResultRank
Automated Speech RecognitionSlideSpeech Ag
WER14.21
10
Automated Speech RecognitionSlideSpeech MI
WER0.1343
10
Automated Speech RecognitionSlideSpeech An
WER25.32
5
Automatic Speech RecognitionDefinedAI Banking target domain (test)
WER10.11
5
Automatic Speech RecognitionDefinedAI Insurance target domain (test)
WER8.71
5
Automatic Speech RecognitionSlideSpeech target
WER14.6
5
Showing 6 of 6 rows

Other info

Follow for update