Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Intelligible Lip-to-Speech Synthesis with Speech Units

About

In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing intelligible speech from a silent lip movement video. Specifically, to complement the insufficient supervisory signal of the previous L2S model, we propose to use quantized self-supervised speech representations, named speech units, as an additional prediction target for the L2S model. Therefore, the proposed L2S model is trained to generate multiple targets, mel-spectrogram and speech units. As the speech units are discrete while mel-spectrogram is continuous, the proposed multi-target L2S model can be trained with strong content supervision, without using text-labeled data. Moreover, to accurately convert the synthesized mel-spectrogram into a waveform, we introduce a multi-input vocoder that can generate a clear waveform even from blurry and noisy mel-spectrogram by referring to the speech units. Extensive experimental results confirm the effectiveness of the proposed method in L2S.

Jeongsoo Choi, Minsu Kim, Yong Man Ro• 2023

Related benchmarks

TaskDatasetResultRank
Video-to-Speech SynthesisLRS3-TED (test)
UTMOS2.702
7
Video-to-Speech SynthesisLRS2-BBC (test)
UTMOS2.331
7
Lip-to-Speech SynthesisLRS3-TED (test)
UTMOS2.7019
7
Lip-to-Speech SynthesisLRS2-BBC (test)
UTMOS2.3315
7
Showing 4 of 4 rows

Other info

Follow for update