UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

About

We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.

Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng• 2026

Related benchmarks

Task	Dataset	Result
Language Understanding	MMLU	Accuracy44.1	844
Language Understanding	MMLU (test)	--	167
Audio Captioning	AudioCaps (test)	CIDEr0.69	157
Automatic Speech Recognition	LibriSpeech Other	WER6.3	123
Automatic Speech Recognition	LibriSpeech Clean	WER2.71	107
Text-to-Speech	Seed-TTS EN	WER3.63	32
Text-to-Speech	LibriSpeech Clean	WER3.46	12
Text-to-Speech	Seed-TTS ZH	WER2.3	12
Automatic Speech Recognition	LibriSpeech (LS) clean	WER2.7	11
Music Captioning	MusicCaps (test)	--	8

Showing 10 of 36 rows

Other info

GitHub

Follow for update

@wizwand_team Discord