AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

About

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.

Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli• 2023

Related benchmarks

Task	Dataset	Result
Visual Speech Recognition	LRS3 (test)	WER2.7	240
Visual Speech Recognition	LRS3 High-Resource, 433h labelled v1 (test)	WER0.013	80
Audio-Visual Speech Recognition	LRS3 (test)	WER1.3	77
Audio-Visual Speech Recognition	LRS3 clean (test)	WER2.5	77
Automatic Speech Recognition	LRS3 (test)	WER (%)2.7	58
Visual Speech Recognition	LRS3 Low-Resource 30h labelled v1 (test)	WER0.027	34
Audio-Visual Speech Recognition	LRS-3 Babble noise at 0dB SNR (test)	WER6.7	32
Visual Speech Recognition	LRS3 30h labeled low-resource (test)	WER30.8	28
Automatic Speech Recognition	LRS3 30h labeled low-resource (test)	WER2.7	26
Audio-Visual Speech Recognition	LRS3 30h labeled low-resource (test)	WER2.7	22

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord