Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

About

Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that self-supervised representation reconstruction (SSRR) loss fundamentally improves codec training and performance. First, SSRR significantly accelerates convergence, enabling competitive results using only a single GPU. Second, it enhances intelligibility by reconstructing distilled self-supervised representations from codec outputs. Third, SSRR enables high intelligibility without additional lookahead in streaming Transformer-based codecs, allowing a zero-lookahead architecture for real-time deployment. As a result, our JHCodec achieves state-of-the-art performance while maintaining minimal latency and reduced training cost. We open-source the full implementation, training pipeline, and demo on Github https://github.com/jhcodec843/jhcodec.

Junhyeok Lee, Xiluo He, Jihwan Lee, Helin Wang, Shrikanth Narayanan, Thomas Thebaud, Laureano Moro-Velazquez, Jes\'us Villalba, Najim Dehak• 2026

Related benchmarks

Task	Dataset	Result
Speech Reconstruction	Librispeech (test-clean)	UT MOS3.3229	64
Speech Coding	TITW Hard (test)	dWER12.28	10
Neural Audio Compression	LibriSpeech (test-other)	WER6.3	10
Automatic Speech Recognition	LibriSpeech	Word Error Rate (WER)4.11	9
Speech Reconstruction	MLS (Multilingual LibriSpeech) Non-English (test)	WER7.44	9
Audio Coding	Audio 16kHz 22kHz (test)	Bitrate (kbps)4	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord