Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
About
Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion with pre-trained diffusion models. The first, which we coin ZEro-shot Text-based Audio (ZETA) editing, is adopted from the image domain. The second, named ZEro-shot UnSupervized (ZEUS) editing, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody. Samples and code can be found in https://hilamanor.github.io/AudioEditing/ .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Editing | AudioCaps | FD (Frechet Distance)57.27 | 24 | |
| Timbre Transfer | MUSDB18 HQ (test) | CLAP0.283 | 8 | |
| Timbre Transfer | MusicDelta | CLAP0.351 | 8 | |
| Audio Edit | Audio Edit (test) | Feature Distance (FD)3.81 | 6 | |
| Music Editing | Music Editing Subjective (evaluation) | Target Attribute Match (T)3.16 | 6 | |
| Music Editing | ZoME-Bench Instrument | CLAP26.1 | 6 | |
| Music Editing | ZoME-Bench Genre | CLAP27.3 | 6 | |
| Audio Target Object Removal | SAVEBench 1.0 (test) | FAD2.69 | 4 | |
| Audio Target Object Removal | SAVEBENCH | FAD2.69 | 4 | |
| Audio Editing | AudioEdit | Overlap Score (OVL)74.6 | 3 |