LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

About

Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.

Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng• 2025

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	--	1043
Mathematical Reasoning	GSM8K	Accuracy (GSM8K)21.8	358
General Knowledge	MMLU	MMLU General Knowledge Accuracy44.7	307
Automatic Speech Recognition	LibriSpeech Other	WER4	123
Question Answering	TriviaQA	Accuracy35.2	117
Automatic Speech Recognition	LibriSpeech Clean	WER3.5	107
Automatic Speech Recognition	VoxPopuli	WER9.5	38
Speech-to-Speech Question-Answering	Llama Questions	Accuracy73.67	27
Speech-to-Text Question-Answering	LlamaQ	Accuracy79.33	26
Speech-to-Text Question-Answering	WebQ	Accuracy51.4	26

Showing 10 of 48 rows

Other info

Follow for update

@wizwand_team Discord