Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VIBEVOICE-ASR Technical Report

About

This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.

Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilong Wang, Li Dong, Yingbo Hao, Yujie Tu, Chenyu Yang, Wenhui Wang, Songchen Xu, Yutao Sun, Hangbo Bao, Weijiang Xu, Yi Zhu, Zehua Wang, Ting Song, Yan Xia, Zewen Chi, Shaohan Huang, Liang Wang, Chuang Ding, Shuai Wang, Xie Chen, Furu Wei• 2026

Related benchmarks

TaskDatasetResultRank
Speaker-attributed Automatic Speech RecognitionAISHELL-4 (test)
CER22.3
18
Speaker-attributed Automatic Speech RecognitionAlimeeting Far (test)
CER34.67
14
Multi-speaker Automatic Speech RecognitionAMI
CP-WER30.51
11
Speaker-attributed Automatic Speech RecognitionAMI SDM
WER22.09
7
Speaker-attributed Automatic Speech RecognitionAliMeeting
Word Error Rate (WER)26.75
7
Speaker-attributed Automatic Speech RecognitionAISHELL-4
WER21.4
6
Speaker-attributed Automatic Speech RecognitionFisher (local setting)
DER17.68
4
Speaker-attributed Automatic Speech RecognitionCandor (local setting)
DER30.89
4
Speaker-attributed Automatic Speech RecognitionMLC Global Meeting-level
DER19.83
4
Speaker-attributed Automatic Speech RecognitionMLC local setting
DER14.01
4
Showing 10 of 17 rows

Other info

Follow for update