Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

About

We introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, 1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. 2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. 3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. Audio samples are are available at https://metis-demo.github.io/.

Yuancheng Wang, Jiachen Zheng, Junan Zhang, Xueyao Zhang, Huan Liao, Zhizheng Wu• 2025

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER5.1
1156
Automatic Speech RecognitionLibriSpeech clean Speech Noise - Additive (test)
WER9.4
28
Automatic Speech RecognitionLibriSpeech other Speech Noise - Additive (test)
WER18
28
Automatic Speech RecognitionLibriSpeech Clean other (test)
WER12.2
28
Automatic Speech RecognitionLibriSpeech other Speech Noise - Reverb (test)
WER50.6
28
Automatic Speech RecognitionLibriSpeech clean Speech Noise - Reverb (test)
WER44.1
28
Voice ConversionVCTK
WER4.49
21
General Speech RestorationDNS-Real Out-Domain (test)
SIG3.59
17
Target Speaker ExtractionLibri2Mix Clean (test)
DNSMOS SIG3.588
9
Target Speaker ExtractionLibri2Mix Single Speaker (test)
WER5.1
5
Showing 10 of 12 rows

Other info

Follow for update