Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

About

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.

Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli• 2023

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 (test)
WER2.7
159
Visual Speech RecognitionLRS3 High-Resource, 433h labelled v1 (test)
WER0.013
80
Audio-Visual Speech RecognitionLRS3 clean (test)
WER2.5
70
Automatic Speech RecognitionLRS3 (test)
WER (%)2.7
46
Visual Speech RecognitionLRS3 Low-Resource 30h labelled v1 (test)
WER0.027
34
Audio-Visual Speech RecognitionLRS-3 Babble noise at 0dB SNR (test)
WER6.7
32
Automatic Speech RecognitionLRS3 433-hour labeled (test)
WER (%)1.4
19
Speech RecognitionLRS3 low-resource
WER (V)30.8
18
Speech RecognitionLRS3 high-resource
WER (V)28.5
18
Automatic Speech RecognitionLRS3 High-Resource, 433h labelled v1 (test)
WER0.013
16
Showing 10 of 16 rows

Other info

Follow for update