Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

About

We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio

Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen• 2025

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER3.02
833
Audio UnderstandingMMAU (test)
Speech Score42.47
25
General Audio UnderstandingMMSU 1.0 (test)
Perception Semantics39.63
16
Audio LLM–EEG Similarity AlignmentNaturalistic Speech Dataset OpenNeuro 2023 (sentence-averaged)
Pearson Correlation (RSA)0.1669
12
Audio LLM–EEG Similarity AlignmentAlice in Wonderland sentence-averaged
Pearson RSA0.2286
12
Speech ReconstructionSeed-ZH
PESQ1.84
12
Speech ReconstructionSeed EN
PESQ1.62
12
Automatic Speech RecognitionAISHELL-2
ZH-CER3.87
9
Massive Multi-discipline Audio UnderstandingMMAU
Speech Score14.4
9
Audio TokenizationSeed-TTS-Eval ZH
PESQ NB2.37
7
Showing 10 of 14 rows

Other info

Follow for update