FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching
About
Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computational efficiency with only a single-step sampling process.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Super-Resolution | VCTK In-domain | LSD1.17 | 34 | |
| Audio Super-Resolution | ESC-50 Out-of-domain | LSD1.63 | 16 | |
| Audio Super-Resolution | Internal Music In-domain | LSD1.43 | 16 | |
| Audio Super-Resolution | MUSDB18-HQ Out-of-domain | LSD1.77 | 16 | |
| Bandwidth extension | VCTK 8 kHz to 44.1 kHz (test) | VISQOL3.49 | 10 | |
| Bandwidth extension | TIMIT 8 kHz to 16 kHz (test) | VISQOL2.59 | 10 | |
| Audio Super-Resolution | VCTK (test) | LSD3.9 | 7 | |
| Audio Super-Resolution | ESC-50 (test) | MOS3.18 | 6 | |
| Audio Super-Resolution | Internal Music (test) | MOS3.12 | 6 | |
| Audio Super-Resolution | MUSDB18 HQ (test) | MOS3.11 | 6 |