Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

About

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.

Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli• 2023

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 (test)
WER2.7
209
Visual Speech RecognitionLRS3 High-Resource, 433h labelled v1 (test)
WER0.013
80
Audio-Visual Speech RecognitionLRS3 (test)
WER1.3
77
Audio-Visual Speech RecognitionLRS3 clean (test)
WER2.5
77
Automatic Speech RecognitionLRS3 (test)
WER (%)2.7
58
Visual Speech RecognitionLRS3 Low-Resource 30h labelled v1 (test)
WER0.027
34
Audio-Visual Speech RecognitionLRS-3 Babble noise at 0dB SNR (test)
WER6.7
32
Visual Speech RecognitionLRS3 30h labeled low-resource (test)
WER30.8
28
Automatic Speech RecognitionLRS3 30h labeled low-resource (test)
WER2.7
26
Audio-Visual Speech RecognitionLRS3 30h labeled low-resource (test)
WER2.7
22
Showing 10 of 20 rows

Other info

Follow for update