Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

About

Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet

Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang• 2025

Related benchmarks

TaskDatasetResultRank
Speech-to-Text Question-AnsweringWebQ
Accuracy62.9
23
Speech-to-Text Question-AnsweringLlamaQ
Accuracy83.33
23
Speech-to-Text Question-AnsweringTriviaQA
Accuracy62.16
23
Speech-to-Speech Question-AnsweringLlama Questions
Accuracy78.6
15
Speech Quality EvaluationOpenAudioBench English subsets (test)
WER3.64
15
Efficiency EvaluationOpenAudioBench English subsets (test)
TPS374.8
15
Text Quality EvaluationOpenAudioBench English subsets (test)
AlpacaEval7.12
15
Speech-to-Speech Question-AnsweringTriviaQA
Accuracy57.3
13
Speech-to-Speech Question-AnsweringWebQ
Accuracy59.8
13
Streaming Speech GenerationStreaming generation scenarios 0.6s speech chunk
First-chunk Latency (ms)462.3
13
Showing 10 of 16 rows

Other info

Follow for update