Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

About

Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that self-supervised representation reconstruction (SSRR) loss fundamentally improves codec training and performance. First, SSRR significantly accelerates convergence, enabling competitive results using only a single GPU. Second, it enhances intelligibility by reconstructing distilled self-supervised representations from codec outputs. Third, SSRR enables high intelligibility without additional lookahead in streaming Transformer-based codecs, allowing a zero-lookahead architecture for real-time deployment. As a result, our JHCodec achieves state-of-the-art performance while maintaining minimal latency and reduced training cost. We open-source the full implementation, training pipeline, and demo on Github https://github.com/jhcodec843/jhcodec.

Junhyeok Lee, Xiluo He, Jihwan Lee, Helin Wang, Shrikanth Narayanan, Thomas Thebaud, Laureano Moro-Velazquez, Jes\'us Villalba, Najim Dehak• 2026

Related benchmarks

TaskDatasetResultRank
Speech ReconstructionLibrispeech (test-clean)
UT MOS3.3229
59
Speech CodingTITW Hard (test)
dWER12.28
10
Neural Audio CompressionLibriSpeech (test-other)
WER6.3
10
Automatic Speech RecognitionLibriSpeech
Word Error Rate (WER)4.11
9
Speech ReconstructionMLS (Multilingual LibriSpeech) Non-English (test)
WER7.44
9
Audio CodingAudio 16kHz 22kHz (test)
Bitrate (kbps)4
8
Showing 6 of 6 rows

Other info

Follow for update