Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Real-Time Target Sound Extraction

About

We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner while also leveraging the generalization performance of transformer-based architectures. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. We provide code, dataset, and audio samples: https://waveformer.cs.washington.edu/.

Bandhav Veluri, Justin Chan, Malek Itani, Tuochao Chen, Takuya Yoshioka, Shyamnath Gollakota• 2022

Related benchmarks

TaskDatasetResultRank
Multi-target Sound ExtractionFSD Kaggle + TAU Urban Acoustic Scenes synthetic mixture 2018 2019 (test)
SI-SNRi (1 Class)9.39
6
Single-target sound extractionFSD Kaggle 2018 and TAU Urban Acoustic Scenes 2019 (test)
SI-SNRi9.43
6
Target Sound Extractiontestset (test)
SI-SNRi11.31
3
Showing 3 of 3 rows

Other info

Code

Follow for update