MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

About

Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at https://hieugiaosu.github.io/MAGE/.

The Hieu Pham, Tan Dat Nguyen, Phuong Thanh Tran, Joon Son Chung, Duc Dung Nguyen• 2025

Related benchmarks

Task	Dataset	Result
Speech Enhancement	DNS Challenge Real Recordings (test)	SIG Score4.206	41
Speech Enhancement	DNS Challenge With Reverb (test)	SIG3.876	24
Automatic Speech Recognition	LibriSpeech noisy (test)	WER0.2345	5
Speech Enhancement	LibriSpeech noisy (test)	SIG Score4.517	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord