Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

About

We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.

Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng• 2026

Related benchmarks

TaskDatasetResultRank
Language UnderstandingMMLU
Accuracy44.1
756
Audio CaptioningAudioCaps (test)
CIDEr0.69
140
Language UnderstandingMMLU (test)--
136
Automatic Speech RecognitionLibriSpeech Other
WER6.3
75
Automatic Speech RecognitionLibriSpeech Clean
WER2.71
57
Text-to-SpeechLibriSpeech Clean
WER3.46
12
Automatic Speech RecognitionLibriSpeech (LS) clean
WER2.7
11
Speech ReconstructionVCTK subset
PESQ (WB)2.36
7
Audio ReconstructionSpeech
MUSHRA90.5
6
Sound ReconstructionAudioCaps (val)
VISQOL Score3.1
6
Showing 10 of 36 rows

Other info

GitHub

Follow for update