Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

About

Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet

Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang• 2025

Related benchmarks

TaskDatasetResultRank
Speech Quality EvaluationOpenAudioBench English subsets (test)
WER3.64
15
Efficiency EvaluationOpenAudioBench English subsets (test)
TPS374.8
15
Text Quality EvaluationOpenAudioBench English subsets (test)
AlpacaEval7.12
15
Streaming Speech GenerationStreaming generation scenarios 0.6s speech chunk
First-chunk Latency (ms)462.3
13
Showing 4 of 4 rows

Other info

Follow for update