The Conversation: Deep Audio-Visual Speech Enhancement

About

Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during training, and for unconstrained environments. We demonstrate strong quantitative and qualitative results, isolating extremely challenging real-world examples.

Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman• 2018

Related benchmarks

Task	Dataset	Result
Audio-visual speech separation	LRS3 (test)	SDRi11.23	29
Audio-visual speech separation	LRS2 (test)	SDRi11.28	23
Speech Enhancement	LRS3 mixed with VGGSound noises (test)	PESQ3.25	18
Speech Enhancement	LRS3 mixed with QUT city-street noises (test)	PESQ3.21	18
Speech Enhancement	LRS2 mixed with VGGSound noises (test)	PESQ3.22	18
Speech Separation	VoxCeleb2-2Mix (test)	SDRi8.9	12
Speaker Separation	LRS2 synthetic (test)	SDR9.25	7
Speaker Separation	LRS3 synthetic (test)	SDR10.15	7
Speech Enhancement	VoxCeleb2 3 simultaneous speakers	PESQ2.59	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord