Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

About

The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.

Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Bj\"orn W. Schuller, Maja Pantic• 2021

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 (test)
WER49.6
159
Visual-only Speech RecognitionLRS2 (test)
WER38.8
63
Visual Speech RecognitionLRS2
Mean WER38.8
45
Lip-readingLRW 1.0 (test)
Top-1 Accuracy88.1
37
Audio-Visual Speech RecognitionLRS2 (test)
WER3.7
34
Lip-readingLRS2 (test)
WER39.1
28
Visual Speech RecognitionLRS3 high-resource (test)
WER49.6
16
Lip-readingLRS3 (test)
WER43.3
8
Showing 8 of 8 rows

Other info

Follow for update