Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation

About

Versatile audio super-resolution (SR) is the challenging task of restoring high-frequency components from low-resolution audio with sampling rates between 4kHz and 32kHz in various domains such as music, speech, and sound effects. Previous diffusion-based SR methods suffer from slow inference due to the need for a large number of sampling steps. In this paper, we introduce FlashSR, a single-step diffusion model for versatile audio super-resolution aimed at producing 48kHz audio. FlashSR achieves fast inference by utilizing diffusion distillation with three objectives: distillation loss, adversarial loss, and distribution-matching distillation loss. We further enhance performance by proposing the SR Vocoder, which is specifically designed for SR models operating on mel-spectrograms. FlashSR demonstrates competitive performance with the current state-of-the-art model in both objective and subjective evaluations while being approximately 22 times faster.

Jaekwon Im, Juhan Nam• 2025

Related benchmarks

TaskDatasetResultRank
Audio Super-ResolutionVCTK In-domain
LSD0.96
34
Audio Super-ResolutionMUSDB18-HQ Out-of-domain
LSD1.19
16
Audio Super-ResolutionInternal Music In-domain
LSD1.14
16
Audio Super-ResolutionESC-50 Out-of-domain
LSD1.54
16
Audio Super-ResolutionVCTK (test)
LSD3
7
Audio Super-ResolutionESC-50 (test)
MOS3.76
6
Audio Super-ResolutionInternal Music (test)
MOS3.78
6
Audio Super-ResolutionMUSDB18 HQ (test)
MOS3.95
6
Binary real/fake audio classificationVCTK 16 to 48 kHz ADSR (test)
Accuracy85
5
Audio Super-ResolutionFMA small (test)
LSD3.6
4
Showing 10 of 12 rows

Other info

Follow for update