Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition

About

Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.

Alexandros Haliassos, Andreas Zinonos, Rodrigo Mira, Stavros Petridis, Maja Pantic• 2024

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 (test)
WER20
159
Automatic Speech RecognitionLibrispeech (test-clean)
WER38.4
84
Visual Speech RecognitionLRS3 High-Resource, 433h labelled v1 (test)
WER0.266
80
Audio-Visual Speech RecognitionLRS3 clean (test)
WER1.1
70
Visual Speech RecognitionLRS3
WER0.201
59
Automatic Speech RecognitionLRS3 (test)
WER (%)1.1
46
Visual Speech RecognitionLRS3 Low-Resource 30h labelled v1 (test)
WER0.04
34
Speech RecognitionLRS3 low-resource
WER (V)30.8
18
Speech RecognitionLRS3 high-resource
WER (V)26.6
18
Audio-Visual Speech RecognitionLRS3 (test)--
18
Showing 10 of 25 rows

Other info

Follow for update