DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
About
Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec compared to state-of-the-art codec systems, such as Mimi Codec, SpeechTokenizer, DAC, and Encodec. Demos are available at: https://dualcodec.github.io, code is available at: https://github.com/jiaqili3/DualCodec
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER9.8 | 833 | |
| Text-to-Speech | Seed-TTS (eval) | WER5.5 | 39 | |
| Voice Conversion | VCTK | WER21.5 | 21 | |
| Speech Reconstruction | SeedTTS en (test) | WER0.0263 | 18 | |
| Speech Reconstruction | Salmon Sentiment Consistency emotional 2025b (OOD) | WER3.6 | 18 | |
| Speech Recognition | Switchboard | WER28.2 | 18 | |
| Speech Reconstruction | LibriSpeech clean (test) | WER2.1 | 15 | |
| Text-to-Speech | LibriTTS clean (test) | WER0.1 | 15 | |
| Audio Encoding and Decoding Efficiency | NVIDIA A6000 Efficiency Benchmark | RTF (Encoding)0.0078 | 12 | |
| Speech Reconstruction | Japanese Versatile Speech unseen language speech 2019 (OOD) | WER5 | 9 |