VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

About

Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet

Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang• 2025

Related benchmarks

Task	Dataset	Result
Speech-to-Speech Question-Answering	Llama Questions	Accuracy78.6	27
Speech-to-Text Question-Answering	WebQ	Accuracy62.9	26
Speech-to-Text Question-Answering	LlamaQ	Accuracy83.33	26
Speech-to-Text Question-Answering	TriviaQA	Accuracy62.16	26
Speech-to-Speech Question-Answering	WebQ	Accuracy59.8	25
Speech-to-Speech Question-Answering	TriviaQA	Accuracy57.3	22
Speech Quality Evaluation	OpenAudioBench English subsets (test)	WER3.64	15
Efficiency Evaluation	OpenAudioBench English subsets (test)	TPS374.8	15
Text Quality Evaluation	OpenAudioBench English subsets (test)	AlpacaEval7.12	15
Spoken Question Answering	TriviaQA	--	15

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord