VoiceBridge: General Speech Restoration with One-step Latent Bridge Models
About
Bridge models have been investigated in speech enhancement but are mostly single-task, with constrained general speech restoration (GSR) capability. In this work, we propose VoiceBridge, a one-step latent bridge model (LBM) for GSR, capable of efficiently reconstructing 48 kHz fullband speech from diverse distortions. To inherit the advantages of data-domain bridge models, we design an energy-preserving variational autoencoder, enhancing the waveform-latent space alignment over varying energy levels. By compressing waveform into continuous latent representations, VoiceBridge models~\textit{various} GSR tasks with a~\textit{single} latent-to-latent generative process backed by a scalable transformer. To alleviate the challenge of reconstructing the high-quality target from distinctively different low-quality priors, we propose a joint neural prior for GSR, uniformly reducing the burden of the LBM in diverse tasks. Building upon these designs, we further investigate bridge training objective by jointly tuning LBM, decoder and discriminator together, transforming the model from a denoiser to generator and enabling \textit{one-step GSR without distillation}. Extensive validation across in-domain (\textit{e.g.}, denoising and super-resolution) and out-of-domain tasks (\textit{e.g.}, refining synthesized speech) and datasets demonstrates the superior performance of VoiceBridge. Demos: https://VoiceBridgedemo.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| General Speech Restoration | DNS-Real Out-Domain (test) | SIG3.473 | 9 | |
| Speech Denoising | WSJ0-CHiME3 (test) | PESQ1.74 | 8 | |
| Bandwidth extension | VCTK-BWE BW=2K (test) | WVMOS4.306 | 7 | |
| General Speech Restoration | Voicefixer-GSR In-Domain (test) | SIG3.494 | 7 | |
| General Speech Restoration | DNS-with-Reverb Out-Domain (test) | SIG3.581 | 7 | |
| Bandwidth extension | VCTK-BWE BW=4K (test) | WVMOS4.404 | 7 | |
| Speech Enhancement | WSJ0-CHiME3 Out-Domain (test) | PESQ1.742 | 7 | |
| Bandwidth extension | VCTK-BWE BW=1K (test) | WVMOS4.154 | 6 | |
| Dereverberation | WSJ0-Reverb (test) | WVMOS4.403 | 6 | |
| Speech Enhancement | VB-Demand In-Domain (test) | PESQ2.831 | 6 |