DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

About

Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec compared to state-of-the-art codec systems, such as Mimi Codec, SpeechTokenizer, DAC, and Encodec. Demos are available at: https://dualcodec.github.io, code is available at: https://github.com/jiaqili3/DualCodec

Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, Zhizheng Wu• 2025

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER9.8	1410
Speech Reconstruction	LibriSpeech clean (test)	UTMOS Score4.12	60
Text-to-Speech	Seed-TTS (eval)	WER5.5	39
Text-to-Speech	LibriTTS clean (test)	WER0.1	37
Speech Recognition	Switchboard	WER28.2	37
Voice Conversion	VCTK	WER21.5	27
Speech Reconstruction	SeedTTS en (test)	WER0.0263	21
Speech Reconstruction	Salmon Sentiment Consistency emotional 2025b (OOD)	WER3.6	18
Neural Speech Coding	LibriSpeech clean (test)	PESQ2.954	16
Audio Encoding and Decoding Efficiency	NVIDIA A6000 Efficiency Benchmark	RTF (Encoding)0.0078	12

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord