Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

About

In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.

Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak• 2024

Related benchmarks

TaskDatasetResultRank
Text-prompted separationInstr pro
SAJ2.65
11
Text-prompted separationSpeaker
SAJ2.26
9
Text-prompted separationInstr(wild)
SAJ2.92
9
Text-prompted separationSpeech
SAJ3.45
9
Text-prompted separationmusic
SAJ2.68
7
Text-prompted separationGeneral SFX
SAJ Score3.29
5
Showing 6 of 6 rows

Other info

Follow for update