FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation
About
Versatile audio super-resolution (SR) is the challenging task of restoring high-frequency components from low-resolution audio with sampling rates between 4kHz and 32kHz in various domains such as music, speech, and sound effects. Previous diffusion-based SR methods suffer from slow inference due to the need for a large number of sampling steps. In this paper, we introduce FlashSR, a single-step diffusion model for versatile audio super-resolution aimed at producing 48kHz audio. FlashSR achieves fast inference by utilizing diffusion distillation with three objectives: distillation loss, adversarial loss, and distribution-matching distillation loss. We further enhance performance by proposing the SR Vocoder, which is specifically designed for SR models operating on mel-spectrograms. FlashSR demonstrates competitive performance with the current state-of-the-art model in both objective and subjective evaluations while being approximately 22 times faster.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Super-Resolution | VCTK In-domain | LSD0.96 | 34 | |
| Audio Super-Resolution | MUSDB18-HQ Out-of-domain | LSD1.19 | 16 | |
| Audio Super-Resolution | Internal Music In-domain | LSD1.14 | 16 | |
| Audio Super-Resolution | ESC-50 Out-of-domain | LSD1.54 | 16 | |
| Audio Super-Resolution | VCTK (test) | LSD3 | 7 | |
| Audio Super-Resolution | ESC-50 (test) | MOS3.76 | 6 | |
| Audio Super-Resolution | Internal Music (test) | MOS3.78 | 6 | |
| Audio Super-Resolution | MUSDB18 HQ (test) | MOS3.95 | 6 | |
| Binary real/fake audio classification | VCTK 16 to 48 kHz ADSR (test) | Accuracy85 | 5 | |
| Audio Super-Resolution | FMA small (test) | LSD3.6 | 4 |