In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

About

Recent advances in speech-aware language models have coupled strong acoustic encoders with large language models, enabling systems that move beyond transcription to produce richer outputs. Among these, word-level timestamp prediction is critical for applications such as captioning, media search, and multimodal synchronization, yet it is often handled by external alignment tools. In this work, we extend an existing speech-aware language model to predict timestamps directly alongside transcripts. We introduce a set of novel lightweight training strategies that improve alignment robustness while preserving recognition quality. Experiments across multiple datasets show that these strategies not only enhance timestamp accuracy, but also yield gains in overall ASR performance. Together, they demonstrate an efficient and unified approach to speech recognition with precise timestamp prediction.

Xulin Fan, Vishal Sunder, Samuel Thomas, Mark Hasegawa-Johnson, Brian Kingsbury, George Saon• 2026

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech Other	WER3.65	140
Automatic Speech Recognition	LibriSpeech Clean	WER1.62	124
Automatic Speech Recognition	AMI	WER9.79	46
Automatic Speech Recognition	VoxPopuli	WER5.95	44
Automatic Speech Recognition	Common Voice	WER8.89	22
Automatic Speech Recognition	MLS	WER5.63	7
Automatic Speech Recognition	TIMIT	WER2.53	7
Automatic Speech Recognition	Buckeye (BUCK)	WER11.51	7
Speech Recognition with Timestamps	LibriSpeech Clean	AAS12.44	6
Speech Recognition with Timestamps	LibriSpeech Other	AAS16.36	6

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord