Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion

About

Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion with pre-trained diffusion models. The first, which we coin ZEro-shot Text-based Audio (ZETA) editing, is adopted from the image domain. The second, named ZEro-shot UnSupervized (ZEUS) editing, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody. Samples and code can be found in https://hilamanor.github.io/AudioEditing/ .

Hila Manor, Tomer Michaeli• 2024

Related benchmarks

Task	Dataset	Result
Audio Editing	AudioCaps	FD (Frechet Distance)57.27	24
Audio Event Removal	Event-level Editing Benchmark	CLAP41.75	8
Music Editing	Music Editing Benchmark	CLAP38.93	8
Audio Event Addition	Event-level Editing Benchmark	CLAP Score47.28	8
Audio Event Replacement	Event-level Editing Benchmark	CLAP Score44.94	8
Timbre Transfer	MUSDB18 HQ (test)	CLAP0.283	8
Timbre Transfer	MusicDelta	CLAP0.351	8
Audio Edit	Audio Edit (test)	Feature Distance (FD)3.81	6
Audio Editing	Audio Editing Add	CLAP Score36.8	6
Audio Editing	Audio Editing Replace	CLAP Score0.378	6

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord