Time Domain Audio Visual Speech Separation

About

Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures. The architecture generalizes the previous TasNet (time-domain speech separation network) to enable multi-modal learning and at meanwhile it extends the classical audio-visual speech separation from frequency-domain to time-domain. The main components of proposed architecture include an audio encoder, a video encoder that extracts lip embedding from video streams, a multi-modal separation network and an audio decoder. Experiments on simulated mixtures based on recently released LRS2 dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on two- and three-speaker cases respectively, compared to audio-only TasNet and frequency-domain audio-visual networks

Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, Meng Yu, Lei Xie, Dong Yu• 2019

Related benchmarks

Task	Dataset	Result
Audio-visual speech separation	LRS2-2Mix (test)	SI-SNRi12.5	33
Audio-visual speech separation	LRS3 (test)	SDRi11.7	29
Audio-visual speech separation	LRS2 (test)	SDRi12.8	23
Audio-Visual Target Speaker Extraction	LRS2 2-mix (test)	DNSMOS2.44	22
Automatic Speech Recognition	LRS2-2Mix (test)	WER31.43	18
Audio-visual speech separation	LRS2	Parameters (M)13.7	18
Audio-visual speech separation	VoxCeleb2 (test)	SI-SNRi9.2	16
Speech Separation	VoxCeleb2-2Mix (test)	SDRi9.8	12
Speech Separation	LRS3-2Mix (test)	SDRi11.7	11
Audio-visual speech separation	LRS2-3Mix (test)	SI-SNRi10	8

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord