Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

About

Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.

Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng• 2025

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval--
850
Mathematical ReasoningGSM8K
Accuracy (GSM8K)21.8
358
General KnowledgeMMLU
MMLU General Knowledge Accuracy44.7
170
Question AnsweringTriviaQA
Accuracy35.2
85
Automatic Speech RecognitionLibriSpeech Other
WER4
75
Automatic Speech RecognitionLibriSpeech Clean
WER3.5
57
Automatic Speech RecognitionVoxPopuli
WER9.5
27
Automatic Speech RecognitionLS Clean
WER3.5
25
Automatic Speech RecognitionVoxPopuli 1.0 (test)
Avg WER9.5
14
Text-to-SpeechLibriSpeech Clean
WER10.1
12
Showing 10 of 36 rows

Other info

Follow for update