Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

About

Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature. Experiments show that Miipher (i) is robust against various audio degradation and (ii) enable us to train a high-quality text-to-speech (TTS) model from restored speech samples collected from the Web. Audio samples are available at our demo page: google.github.io/df-conformer/miipher/

Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Yu Zhang, Wei Han, Ankur Bapna, Michiel Bacchiani• 2023

Related benchmarks

Task	Dataset	Result
General Speech Restoration	DNS-Real Out-Domain (test)	SIG3.325	17
Speech Restoration	DNS Challenge real-recording 2020	DNSMOS Score3.17	14
General Speech Restoration	Voicefixer-GSR In-Domain (test)	SIG3.335	7
General Speech Restoration	DNS-with-Reverb Out-Domain (test)	SIG3.401	7

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord