EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding
About
Audio codecs power discrete music generative modelling, music streaming and immersive media by shrinking PCM audio to bandwidth-friendly bit-rates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram-domains typically struggle with phase modeling which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance. Compared to standard baselines that train for hundreds of thousands of steps, our model reducing training budget by an order of magnitude is markedly more compute-efficient while preserving high perceptual quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Coding | LibriTTS Out of Domain 24 kHz 6 kbps (test) | SI-SDR7.58 | 4 | |
| Audio Coding | LibriTTS In Domain, 24 kHz, 6 kbps (test) | SI-SDR10.5 | 4 | |
| Audio Coding | LibriTTS Out of Domain, 24 kHz, 12 kbps (test) | SI-SDR11.2 | 3 | |
| Audio Coding | LibriTTS In Domain, 24 kHz, 12 kbps (test) | SI-SDR13.67 | 3 |