AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
About
Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Speech Recognition | LRS3 (test) | WER2.7 | 159 | |
| Visual Speech Recognition | LRS3 High-Resource, 433h labelled v1 (test) | WER0.013 | 80 | |
| Audio-Visual Speech Recognition | LRS3 clean (test) | WER2.5 | 70 | |
| Automatic Speech Recognition | LRS3 (test) | WER (%)2.7 | 46 | |
| Visual Speech Recognition | LRS3 Low-Resource 30h labelled v1 (test) | WER0.027 | 34 | |
| Audio-Visual Speech Recognition | LRS-3 Babble noise at 0dB SNR (test) | WER6.7 | 32 | |
| Automatic Speech Recognition | LRS3 433-hour labeled (test) | WER (%)1.4 | 19 | |
| Speech Recognition | LRS3 low-resource | WER (V)30.8 | 18 | |
| Speech Recognition | LRS3 high-resource | WER (V)28.5 | 18 | |
| Automatic Speech Recognition | LRS3 High-Resource, 433h labelled v1 (test) | WER0.013 | 16 |