Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization

About

Language-queried Audio Source Separation (LASS) enables open-vocabulary sound separation via natural language queries. While existing methods rely on task-specific training, we explore whether pretrained diffusion models, originally designed for audio generation, can inherently perform separation without further training. In this study, we introduce a training-free framework leveraging generative priors for zero-shot LASS. Analyzing naive adaptations, we identify key limitations arising from modality-specific challenges. To address these issues, we propose Diffusion-Guided Mask Optimization (DGMO), a test-time optimization framework that refines spectrogram masks for precise, input-aligned separation. Our approach effectively repurposes pretrained diffusion models for source separation, achieving competitive performance without task-specific supervision. This work expands the application of diffusion models beyond generation, establishing a new paradigm for zero-shot audio separation. The code is available at: https://wltschmrz.github.io/DGMO/

Geonyoung Lee, Geonhee Han, Paul Hongsuck Seo• 2025

Related benchmarks

TaskDatasetResultRank
Target Speaker ExtractionPORTE (0% overlap)
SISDRi-2.92
10
Target Speaker ExtractionPORTE Avg.
SISDRi-4.56
10
Target Speaker ExtractionPORTE (20% overlap)
SISDRi-4.29
10
Target Speaker ExtractionPORTE (40% overlap)
SISDRi-4.52
10
Target Speaker ExtractionPORTE (60% overlap)
SISDRi-5.06
10
Target Speaker ExtractionPORTE 80% overlap
SISDRi-5.3
10
Target Speaker ExtractionPORTE 100% overlap
SISDRi-5.25
10
Target Speaker ExtractionPORTE Avg.
SuRE0.36
5
Showing 8 of 8 rows

Other info

Follow for update