UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
About
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Understanding | MMLU | Accuracy44.1 | 756 | |
| Audio Captioning | AudioCaps (test) | CIDEr0.69 | 140 | |
| Language Understanding | MMLU (test) | -- | 136 | |
| Automatic Speech Recognition | LibriSpeech Other | WER6.3 | 75 | |
| Automatic Speech Recognition | LibriSpeech Clean | WER2.71 | 57 | |
| Text-to-Speech | LibriSpeech Clean | WER3.46 | 12 | |
| Automatic Speech Recognition | LibriSpeech (LS) clean | WER2.7 | 11 | |
| Speech Reconstruction | VCTK subset | PESQ (WB)2.36 | 7 | |
| Audio Reconstruction | Speech | MUSHRA90.5 | 6 | |
| Sound Reconstruction | AudioCaps (val) | VISQOL Score3.1 | 6 |