Intelligible Lip-to-Speech Synthesis with Speech Units

About

In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing intelligible speech from a silent lip movement video. Specifically, to complement the insufficient supervisory signal of the previous L2S model, we propose to use quantized self-supervised speech representations, named speech units, as an additional prediction target for the L2S model. Therefore, the proposed L2S model is trained to generate multiple targets, mel-spectrogram and speech units. As the speech units are discrete while mel-spectrogram is continuous, the proposed multi-target L2S model can be trained with strong content supervision, without using text-labeled data. Moreover, to accurately convert the synthesized mel-spectrogram into a waveform, we introduce a multi-input vocoder that can generate a clear waveform even from blurry and noisy mel-spectrogram by referring to the speech units. Extensive experimental results confirm the effectiveness of the proposed method in L2S.

Jeongsoo Choi, Minsu Kim, Yong Man Ro• 2023

Related benchmarks

Task	Dataset	Result
Cued Speech-to-Speech Synthesis	Unified-HI Normal-hearing cuers v1 (test)	LSE-C4.215	8
Video-to-Speech Synthesis	LRS3-TED (test)	UTMOS2.702	7
Video-to-Speech Synthesis	LRS2-BBC (test)	UTMOS2.331	7
Lip-to-Speech Synthesis	LRS3-TED (test)	UTMOS2.7019	7
Cued Speech-to-Speech Synthesis	Unified-HI Hearing-impaired cuers v1 (test)	WER98.9	7
Lip-to-Speech Synthesis	LRS2-BBC (test)	UTMOS2.3315	7

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord