AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook

About

We propose AUV, a unified neural audio codec with a single codebook, which enables a favourable reconstruction of speech and further extends to general audio, including vocal, music, and sound. AUV is capable of tackling any 16 kHz mixed-domain audio segment at bit rates around 700 bps. To accomplish this, we guide the matryoshka codebook with nested domain-specific partitions, assigned with corresponding teacher models to perform distillation, all in a single-stage training. A conformer-style encoder-decoder architecture with STFT features as audio representation is employed, yielding better audio quality. Comprehensive evaluations demonstrate that AUV exhibits comparable audio reconstruction ability to state-of-the-art domain-specific single-layer quantizer codecs, showcasing the potential of audio universal vector quantization with a single codebook. The pre-trained model and demo samples are available at https://swivid.github.io/AUV/.

Yushen Chen, Kai Hu, Long Zhou, Shulin Feng, Xusheng Yang, Hangting Chen, Xie Chen• 2025

Related benchmarks

Task	Dataset	Result
Acoustic Consistency	SALMon	Speaker Consistency47.5	66
Audio Reconstruction	AudioSet (eval)	Mel Distance1.26	63
Audio Reconstruction	LibriSpeech clean (test)	STOI0.91	25
Audio Reconstruction	Codec-SUPERB tiny (Speech)	Mel1.847	14
Audio Reconstruction	Codec-SUPERB tiny (Overall)	Mel1.246	7
Audio Reconstruction	Codec-SUPERB Music tiny	Mel1.129	7
Token-level Predictability	LibriSpeech 100	Eval Loss11.98	5

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord