Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation
About
In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature regression task using the multimodal target features generated online by the teacher. The parameters of the teacher model are a momentum update of the student. Since our target features are generated online, AV2vec needs no iteration step like AV-HuBERT and the total training time cost is reduced to less than one-fifth. We further propose AV2vec-MLM in this study, which augments AV2vec with a masked language model (MLM)-style loss using multitask learning. Our experimental results show that AV2vec achieved comparable performance to the AV-HuBERT baseline. When combined with an MLM-style loss, AV2vec-MLM outperformed baselines and achieved the best performance on the downstream tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Speech Recognition | LRS3 (test) | WER5.4 | 159 | |
| Visual Speech Recognition | LRS3 High-Resource, 433h labelled v1 (test) | WER0.025 | 80 | |
| Audio-Visual Speech Recognition | LRS3 clean (test) | WER2.5 | 70 | |
| Automatic Speech Recognition | LRS3 (test) | WER (%)5.6 | 46 | |
| Audio-Visual Speech Recognition | LRS-3 Babble noise at 0dB SNR (test) | WER6.7 | 32 | |
| English Transcription | LRS3 Noisy 0-SNR (test) | WER0.067 | 25 | |
| Automatic Speech Recognition | LRS3 Clean original (test) | WER2.7 | 21 | |
| Automatic Speech Recognition | LRS3 433-hour labeled (test) | WER (%)2.7 | 19 |