CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

About

Text-guided sound separation enables flexible audio editing, assistive listening, and open-domain source extraction, but systems such as AudioSep remain too expensive for low-latency edge or codec-mediated deployment. Existing neural audio codec separators are efficient, yet largely restricted to fixed stems or closed taxonomies. We introduce CodecSep, a prompt-driven universal sound separation framework that extracts sources directly in neural audio codec latent space. CodecSep combines a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP text embeddings, enabling open-vocabulary separation while preserving codec-native efficiency. Across dnr-v2 and five open-domain benchmarks, CodecSep consistently improves over AudioSep in SI-SDR, remains competitive in ViSQOL, and achieves clear gains in human MOS-LQS. Controlled analyses show that fine-grained prompts outperform coarse labels, and that explicit latent masking is substantially more effective than decoder-style latent generation in codec space. Qualitative diagnostics show that neural audio codec latents retain source-dependent structure, which CodecSep exploits mainly through channel-wise source-conditioned modulation. CodecSep also provides a practical code-stream deployment path. When audio is transmitted as neural audio codec codes, CodecSep maps codes to embeddings, separates directly in codec space, and outputs waveforms or re-quantized codes, avoiding the decode-separate-re-encode loop. In this regime, CodecSep requires only 1.35 GMACs end-to-end: about 54 times less compute than AudioSep in the same pipeline and 25 times lower separator-only compute, with much lower latency and memory. More broadly, CodecSep offers a blueprint for codec-native downstream audio processing.

Adhiraj Banerjee, Vipul Arora• 2025

Related benchmarks

Task	Dataset	Result
Audio Separation	NVIDIA A-30 GPU Compute Profiling	Memory Footprint (MB)28.06	9
Sound Separation	Efficiency Benchmarking Suite (inference)	GMACs (Inference)1.35	9
Sound Separation	NVIDIA Sound Separation Efficiency Benchmarking A-30	Full Memory Footprint (MB)76.46	9
Sound Separation	NVIDIA A-30 GPU Compute Benchmark	Parameter-only Memory Footprint (MB)48.4	9
Sound Separation	Sound Separation	Parameters (M)16.3	9
Universal Sound Separation	dnr v2 (test)	Music SI-SDR1.2	8
Source Separation	dnr v2 (test)	Overall MOS-LQS Score3.34	2
Sound Separation	ESC-50	SI-SDR5.9	2
Sound Separation	Clotho V2	SI-SDR6	2
Sound Separation	AudioSet	SI-SDR6.4	2

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord