Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

About

In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature regression task using the multimodal target features generated online by the teacher. The parameters of the teacher model are a momentum update of the student. Since our target features are generated online, AV2vec needs no iteration step like AV-HuBERT and the total training time cost is reduced to less than one-fifth. We further propose AV2vec-MLM in this study, which augments AV2vec with a masked language model (MLM)-style loss using multitask learning. Our experimental results show that AV2vec achieved comparable performance to the AV-HuBERT baseline. When combined with an MLM-style loss, AV2vec-MLM outperformed baselines and achieved the best performance on the downstream tasks.

Jing-Xuan Zhang, Genshun Wan, Zhen-Hua Ling, Jia Pan, Jianqing Gao, Cong Liu• 2022

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 (test)
WER5.4
159
Visual Speech RecognitionLRS3 High-Resource, 433h labelled v1 (test)
WER0.025
80
Audio-Visual Speech RecognitionLRS3 clean (test)
WER2.5
70
Automatic Speech RecognitionLRS3 (test)
WER (%)5.6
46
Audio-Visual Speech RecognitionLRS-3 Babble noise at 0dB SNR (test)
WER6.7
32
English TranscriptionLRS3 Noisy 0-SNR (test)
WER0.067
25
Automatic Speech RecognitionLRS3 Clean original (test)
WER2.7
21
Automatic Speech RecognitionLRS3 433-hour labeled (test)
WER (%)2.7
19
Showing 8 of 8 rows

Other info

Follow for update