Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

About

Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by a relative 49.0% and outperforms all non-WhisTLE baselines in 100 of 112 scenarios. We also find that WhisTLE additively complements any combination of other domain adaptation approaches; we thus recommend the inclusion of WhisTLE during standard processes for adapting encoder-decoder ASR models.

Akshat Pandey, Karun Kumar, Raphael Tang• 2025

Related benchmarks

TaskDatasetResultRank
Speech RecognitionEMNS
WER4.9
68
Speech RecognitionEmoV-DB
WER4
68
Speech RecognitionST-AEDS
WER2
68
Speech RecognitionEABI
Word Error Rate (WER)2.2
64
Automatic Speech RecognitionUIED
WER7
4
Speech RecognitionEMNS (Out-of-Domain)
WER7.4
4
Speech RecognitionEmoV-DB (Out-of-Domain)
WER16.8
4
Speech RecognitionUIED Out-of-Domain
WER6.4
4
Speech RecognitionST-AEDS (Out-of-Domain)
WER4.5
4
Showing 9 of 9 rows

Other info

Follow for update