Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

About

Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert

Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed• 2022

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 (test)
WER3.3
159
Automatic Speech RecognitionLibrispeech (test-clean)
WER29.1
84
Visual Speech RecognitionLRS3 High-Resource, 433h labelled v1 (test)
WER0.014
80
Audio-Visual Speech RecognitionLRS3 clean (test)
WER26.9
70
Visual Speech RecognitionLRS3
WER0.269
59
Automatic Speech RecognitionLRS3 (test)
WER (%)1.3
46
Emotion RecognitionIEMOCAP 4-class (test)
WAR46.45
46
Visual Speech RecognitionLRS2
Mean WER25.5
45
Visual Speech RecognitionLRS3 Low-Resource 30h labelled v1 (test)
WER0.033
34
Audio-Visual Speech RecognitionLRS2 (test)
WER3.1
34
Showing 10 of 38 rows

Other info

Code

Follow for update