Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

About

This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in \ac{WER} across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Demo page: https://ssnapsicml.github.io/ssnapsicml2026/

Yochai Yemini, Yoav Ellinson, Rami Ben-Ari, Sharon Gannot, Ethan Fetaya• 2026

Related benchmarks

TaskDatasetResultRank
Audio-visual speech separationVoxCeleb2 (on-screen speakers)
SISDR6.17
4
Audio-visual speech separationVoxCeleb2 (off-screen speakers)
SISDR4.28
4
Speech SeparationVoxCeleb2+DCASE 3 speakers+noise (test)
SISDR7.84
4
Speech SeparationVoxCeleb2 2-speaker mixture, On-screen speaker (test)
SISDR6.24
4
Speech SeparationVoxCeleb2 2-speaker mixture, Off-screen speaker (test)
SISDR4.56
4
Speech SeparationVoxCeleb2+DCASE 1 speaker+noise (test)
SISDR9.14
4
Speech SeparationVoxCeleb2+DCASE 2 speakers+noise (test)
SISDR6.35
4
Showing 7 of 7 rows

Other info

Follow for update