Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

About

Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.

Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou• 2026

Related benchmarks

TaskDatasetResultRank
Speech ReconstructionLibrispeech (test-clean)
UT MOS4.19
64
Audio UnderstandingMMSU
Perception Score35.54
37
Audio ReconstructionSeed EN
Word Error Rate (WER)2.55
20
Audio UnderstandingMMAU
Overall Score61.1
14
Speech ReconstructionLibrispeech other (test)
WER6.79
9
Audio UnderstandingMMAR
Speech Score45.24
5
Clustering AnalysisESC-10
Silhouette Score0.091
5
Clustering AnalysisESC-50
Silhouette Score0.023
5
Speech ReconstructionSeed-TTS ZH
WER1.9
5
Text-to-Speech SynthesisSEED-TTS en | zh | avg.
SIM (en)0.792
2
Showing 10 of 10 rows

Other info

Follow for update