UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

About

Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.

Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou• 2026

Related benchmarks

Task	Dataset	Result
Speech Reconstruction	Librispeech (test-clean)	UT MOS4.19	64
Audio Understanding	MMSU	Perception Score35.54	37
Audio Reconstruction	Seed EN	Word Error Rate (WER)2.55	20
Audio Understanding	MMAU	Overall Score61.1	14
Speech Reconstruction	Librispeech other (test)	WER6.79	9
Audio Understanding	MMAR	Speech Score45.24	5
Clustering Analysis	ESC-10	Silhouette Score0.091	5
Clustering Analysis	ESC-50	Silhouette Score0.023	5
Speech Reconstruction	Seed-TTS ZH	WER1.9	5
Text-to-Speech Synthesis	SEED-TTS en \| zh \| avg.	SIM (en)0.792	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord