UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception
About
Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Reconstruction | Librispeech (test-clean) | UT MOS4.19 | 64 | |
| Audio Understanding | MMSU | Perception Score35.54 | 37 | |
| Audio Reconstruction | Seed EN | Word Error Rate (WER)2.55 | 20 | |
| Audio Understanding | MMAU | Overall Score61.1 | 14 | |
| Speech Reconstruction | Librispeech other (test) | WER6.79 | 9 | |
| Audio Understanding | MMAR | Speech Score45.24 | 5 | |
| Clustering Analysis | ESC-10 | Silhouette Score0.091 | 5 | |
| Clustering Analysis | ESC-50 | Silhouette Score0.023 | 5 | |
| Speech Reconstruction | Seed-TTS ZH | WER1.9 | 5 | |
| Text-to-Speech Synthesis | SEED-TTS en | zh | avg. | SIM (en)0.792 | 2 |