Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

About

Text-guided sound separation enables flexible audio editing, assistive listening, and open-domain source extraction, but systems such as AudioSep remain too expensive for low-latency edge or codec-mediated deployment. Existing neural audio codec separators are efficient, yet largely restricted to fixed stems or closed taxonomies. We introduce CodecSep, a prompt-driven universal sound separation framework that extracts sources directly in neural audio codec latent space. CodecSep combines a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP text embeddings, enabling open-vocabulary separation while preserving codec-native efficiency. Across dnr-v2 and five open-domain benchmarks, CodecSep consistently improves over AudioSep in SI-SDR, remains competitive in ViSQOL, and achieves clear gains in human MOS-LQS. Controlled analyses show that fine-grained prompts outperform coarse labels, and that explicit latent masking is substantially more effective than decoder-style latent generation in codec space. Qualitative diagnostics show that neural audio codec latents retain source-dependent structure, which CodecSep exploits mainly through channel-wise source-conditioned modulation. CodecSep also provides a practical code-stream deployment path. When audio is transmitted as neural audio codec codes, CodecSep maps codes to embeddings, separates directly in codec space, and outputs waveforms or re-quantized codes, avoiding the decode-separate-re-encode loop. In this regime, CodecSep requires only 1.35 GMACs end-to-end: about 54 times less compute than AudioSep in the same pipeline and 25 times lower separator-only compute, with much lower latency and memory. More broadly, CodecSep offers a blueprint for codec-native downstream audio processing.

Adhiraj Banerjee, Vipul Arora• 2025

Related benchmarks

TaskDatasetResultRank
Audio SeparationNVIDIA A-30 GPU Compute Profiling
Memory Footprint (MB)28.06
9
Sound SeparationEfficiency Benchmarking Suite (inference)
GMACs (Inference)1.35
9
Sound SeparationNVIDIA Sound Separation Efficiency Benchmarking A-30
Full Memory Footprint (MB)76.46
9
Sound SeparationNVIDIA A-30 GPU Compute Benchmark
Parameter-only Memory Footprint (MB)48.4
9
Sound SeparationSound Separation
Parameters (M)16.3
9
Universal Sound Separationdnr v2 (test)
Music SI-SDR1.2
8
Source Separationdnr v2 (test)
Overall MOS-LQS Score3.34
2
Sound SeparationESC-50
SI-SDR5.9
2
Sound SeparationClotho V2
SI-SDR6
2
Sound SeparationAudioSet
SI-SDR6.4
2
Showing 10 of 12 rows

Other info

Follow for update