Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

About

Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.

Jisoo Park, Seonghak Lee, Guisik Kim, Taewoo Kim, Junseok Kwon• 2025

Related benchmarks

TaskDatasetResultRank
Speech EnhancementGRID and DEMAND Station noise (test)
SDR2.11
6
Speech EnhancementGRID and DEMAND Kitchen noise (test)
SDR3.1
6
Speech EnhancementGRID and DEMAND Metro noise (test)
SDR1.01
6
Speech EnhancementGRID and DEMAND Cafeteria noise (test)
SDR1.17
6
Speech SeparationGRID (test)
SDR1.46
5
Speech Separation (2-speaker)GRID (test)
SDR1.46
4
Showing 6 of 6 rows

Other info

Follow for update