DDTSE: Discriminative Diffusion Model for Target Speech Extraction

About

Diffusion models have gained attention in speech enhancement tasks, providing an alternative to conventional discriminative methods. However, research on target speech extraction under multi-speaker noisy conditions remains relatively unexplored. Moreover, the superior quality of diffusion methods typically comes at the cost of slower inference speed. In this paper, we introduce the Discriminative Diffusion model for Target Speech Extraction (DDTSE). We apply the same forward process as diffusion models and utilize the reconstruction loss similar to discriminative methods. Furthermore, we devise a two-stage training strategy to emulate the inference process during model training. DDTSE not only works as a standalone system, but also can further improve the performance of discriminative models without additional retraining. Experimental results demonstrate that DDTSE not only achieves higher perceptual quality but also accelerates the inference process by 3 times compared to the conventional diffusion model.

Leying Zhang, Yao Qian, Linfeng Yu, Heming Wang, Hemin Yang, Long Zhou, Shujie Liu, Yanmin Qian• 2023

Related benchmarks

Task	Dataset	Result
Target Speaker Extraction	Libri2Mix Clean	DNSMOS OVL3.83	14
Target Speaker Extraction	Libri2Mix Clean min 16 kHz	PESQ1.85	9
Target Speaker Extraction	Libri2Mix Noisy min 16 kHz	PESQ1.6	8
Target Speaker Extraction	Libri2Mix noisy	PESQ1.6	7
Target Speech Extraction	Libri2Mix-360 2-speaker 16kHz (test)	SI-SDR13.8	1
Target Speech Extraction	Libri2Mix 360 2-speaker+noise 16kHz (test)	SI-SDR9.7	1

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord