CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents
About
Text-guided sound separation enables flexible audio editing, assistive listening, and open-domain source extraction, but systems such as AudioSep remain too expensive for low-latency edge or codec-mediated deployment. Existing neural audio codec separators are efficient, yet largely restricted to fixed stems or closed taxonomies. We introduce CodecSep, a prompt-driven universal sound separation framework that extracts sources directly in neural audio codec latent space. CodecSep combines a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP text embeddings, enabling open-vocabulary separation while preserving codec-native efficiency. Across dnr-v2 and five open-domain benchmarks, CodecSep consistently improves over AudioSep in SI-SDR, remains competitive in ViSQOL, and achieves clear gains in human MOS-LQS. Controlled analyses show that fine-grained prompts outperform coarse labels, and that explicit latent masking is substantially more effective than decoder-style latent generation in codec space. Qualitative diagnostics show that neural audio codec latents retain source-dependent structure, which CodecSep exploits mainly through channel-wise source-conditioned modulation. CodecSep also provides a practical code-stream deployment path. When audio is transmitted as neural audio codec codes, CodecSep maps codes to embeddings, separates directly in codec space, and outputs waveforms or re-quantized codes, avoiding the decode-separate-re-encode loop. In this regime, CodecSep requires only 1.35 GMACs end-to-end: about 54 times less compute than AudioSep in the same pipeline and 25 times lower separator-only compute, with much lower latency and memory. More broadly, CodecSep offers a blueprint for codec-native downstream audio processing.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Separation | NVIDIA A-30 GPU Compute Profiling | Memory Footprint (MB)28.06 | 9 | |
| Sound Separation | Efficiency Benchmarking Suite (inference) | GMACs (Inference)1.35 | 9 | |
| Sound Separation | NVIDIA Sound Separation Efficiency Benchmarking A-30 | Full Memory Footprint (MB)76.46 | 9 | |
| Sound Separation | NVIDIA A-30 GPU Compute Benchmark | Parameter-only Memory Footprint (MB)48.4 | 9 | |
| Sound Separation | Sound Separation | Parameters (M)16.3 | 9 | |
| Universal Sound Separation | dnr v2 (test) | Music SI-SDR1.2 | 8 | |
| Source Separation | dnr v2 (test) | Overall MOS-LQS Score3.34 | 2 | |
| Sound Separation | ESC-50 | SI-SDR5.9 | 2 | |
| Sound Separation | Clotho V2 | SI-SDR6 | 2 | |
| Sound Separation | AudioSet | SI-SDR6.4 | 2 |