VIBEVOICE-ASR Technical Report
About
This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speaker-attributed Automatic Speech Recognition | AISHELL-4 (test) | CER22.3 | 18 | |
| Speaker-attributed Automatic Speech Recognition | Alimeeting Far (test) | CER34.67 | 14 | |
| Multi-speaker Automatic Speech Recognition | AMI | CP-WER30.51 | 11 | |
| Speaker-attributed Automatic Speech Recognition | AMI SDM | WER22.09 | 7 | |
| Speaker-attributed Automatic Speech Recognition | AliMeeting | Word Error Rate (WER)26.75 | 7 | |
| Speaker-attributed Automatic Speech Recognition | AISHELL-4 | WER21.4 | 6 | |
| Speaker-attributed Automatic Speech Recognition | Fisher (local setting) | DER17.68 | 4 | |
| Speaker-attributed Automatic Speech Recognition | Candor (local setting) | DER30.89 | 4 | |
| Speaker-attributed Automatic Speech Recognition | MLC Global Meeting-level | DER19.83 | 4 | |
| Speaker-attributed Automatic Speech Recognition | MLC local setting | DER14.01 | 4 |