SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

About

Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models. However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, indicating superior naturalness and intelligibility. Moreover, SAC substantially surpasses prior codecs in semantic representation, approaching the level of continuous self-supervised embeddings. When used as a tokenizer for LLM-based text-to-speech, SAC enables a single-stage autoregressive (AR) TTS model that clearly outperforms state-of-the-art AR systems. Our disentanglement analysis further validates the effectiveness of the dual-stream design, offering new potential for controllable speech generation.

Wenxi Chen, Xinsheng Wang, Ruiqi Yan, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiquan Li, Yuzhe Liang, Hanlin Wen, Shunshun Yin, Ming Tao, Xie Chen• 2025

Related benchmarks

Task	Dataset	Result
Speech Reconstruction	Chinese speech	UTMOS2.99	19
Speech Reconstruction	English speech	UTMOS3.88	19
Neural Speech Coding	LibriSpeech clean (test)	STOI90	11
Neural Speech Coding	LibriSpeech (test-other)	STOI0.87	11
Voice Conversion	LibriTTS (test-clean)	WER24.21	11

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord