NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

About

Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new benchmark for low-latency and efficient Speech LLM training and inference.

Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Juki\'c, Jason Li, Boris Ginsburg• 2025

Related benchmarks

Task	Dataset	Result
Speech Reconstruction	Librispeech (test-clean)	UT MOS3.163	64
Neural Audio Compression	LibriSpeech (test-other)	WER6.11	10
Speech Coding	TITW Hard (test)	dWER12.79	10
Speech Reconstruction	MLS (Multilingual LibriSpeech) Non-English (test)	WER7.5	9
Automatic Speech Recognition	LibriSpeech	Word Error Rate (WER)7.26	9
Audio Coding	Audio 16kHz 22kHz (test)	Bitrate (kbps)1.78	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord