Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

About

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

Luca Della Libera, Cem Subakan, Mirco Ravanelli• 2025

Related benchmarks

TaskDatasetResultRank
Speech ReconstructionLibrispeech (test-clean)
UT MOS2.9772
59
Neural Audio CompressionLibriSpeech (test-other)
WER9.34
10
Speech CodingTITW Hard (test)
dWER20.23
10
Speech ReconstructionMLS (Multilingual LibriSpeech) Non-English (test)
WER11.48
9
Audio CodingAudio 16kHz 22kHz (test)
Bitrate (kbps)0.8
8
Showing 5 of 5 rows

Other info

Follow for update