Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

About

Recent advances in speech-aware language models have coupled strong acoustic encoders with large language models, enabling systems that move beyond transcription to produce richer outputs. Among these, word-level timestamp prediction is critical for applications such as captioning, media search, and multimodal synchronization, yet it is often handled by external alignment tools. In this work, we extend an existing speech-aware language model to predict timestamps directly alongside transcripts. We introduce a set of novel lightweight training strategies that improve alignment robustness while preserving recognition quality. Experiments across multiple datasets show that these strategies not only enhance timestamp accuracy, but also yield gains in overall ASR performance. Together, they demonstrate an efficient and unified approach to speech recognition with precise timestamp prediction.

Xulin Fan, Vishal Sunder, Samuel Thomas, Mark Hasegawa-Johnson, Brian Kingsbury, George Saon• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech Other
WER3.65
123
Automatic Speech RecognitionLibriSpeech Clean
WER1.62
107
Automatic Speech RecognitionVoxPopuli
WER5.95
38
Automatic Speech RecognitionAMI
WER9.79
35
Automatic Speech RecognitionCommon Voice
WER8.89
22
Automatic Speech RecognitionMLS
WER5.63
7
Automatic Speech RecognitionTIMIT
WER2.53
7
Automatic Speech RecognitionBuckeye (BUCK)
WER11.51
7
Speech Recognition with TimestampsLibriSpeech Clean
AAS12.44
6
Speech Recognition with TimestampsLibriSpeech Other
AAS16.36
6
Showing 10 of 16 rows

Other info

Follow for update