AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook
About
We propose AUV, a unified neural audio codec with a single codebook, which enables a favourable reconstruction of speech and further extends to general audio, including vocal, music, and sound. AUV is capable of tackling any 16 kHz mixed-domain audio segment at bit rates around 700 bps. To accomplish this, we guide the matryoshka codebook with nested domain-specific partitions, assigned with corresponding teacher models to perform distillation, all in a single-stage training. A conformer-style encoder-decoder architecture with STFT features as audio representation is employed, yielding better audio quality. Comprehensive evaluations demonstrate that AUV exhibits comparable audio reconstruction ability to state-of-the-art domain-specific single-layer quantizer codecs, showcasing the potential of audio universal vector quantization with a single codebook. The pre-trained model and demo samples are available at https://swivid.github.io/AUV/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Acoustic Consistency | SALMon | Speaker Consistency47.5 | 66 | |
| Audio Reconstruction | AudioSet (eval) | Mel Distance1.26 | 63 | |
| Audio Reconstruction | LibriSpeech clean (test) | STOI0.91 | 17 | |
| Audio Reconstruction | Codec-SUPERB tiny (Speech) | Mel1.847 | 14 | |
| Audio Reconstruction | Codec-SUPERB tiny (Overall) | Mel1.246 | 7 | |
| Audio Reconstruction | Codec-SUPERB Music tiny | Mel1.129 | 7 | |
| Token-level Predictability | LibriSpeech 100 | Eval Loss11.98 | 5 |