SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs
About
Neural Speech Codecs face a fundamental trade-off at low bitrates: preserving acoustic fidelity often compromises semantic richness. To address this, we introduce SACodec, a novel codec built upon an asymmetric dual-quantizer that employs our proposed Semantic Anchoring mechanism. This design strategically decouples the quantization of Semantic and Acoustic details. The semantic anchoring is achieved via a lightweight projector that aligns acoustic features with a frozen, large-scale mHuBERT codebook, injecting linguistic priors while guaranteeing full codebook utilization. Sequentially, for acoustic details, a residual activation module with SimVQ enables a single-layer quantizer (acoustic path) to faithfully recover fine-grained information. At just 1.5 kbps, SACodec establishes a new state of the art by excelling in both fidelity and semantics: subjective listening tests confirm that its reconstruction quality is perceptually highly comparable to ground-truth audio, while its tokens demonstrate substantially improved semantic richness in downstream tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Reconstruction | LibriTTS clean (test) | PESQ2.6937 | 50 | |
| Speech Reconstruction | LibriTTS (test-other) | UTMOS3.4786 | 44 | |
| Audio Reconstruction | LJSpeech | UTMOS3.9912 | 26 | |
| Semantic Representation Evaluation | ARCH (test) | RAVDESS42.65 | 13 | |
| Semantic Representation Classification | ARCH Reconstruction Domain | RAVDESS Accuracy75.69 | 10 |