Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

About

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.

Takaki Makino, Hank Liao, Yannis Assael, Brendan Shillingford, Basilio Garcia, Otavio Braga, Olivier Siohan (1) __INSTITUTION_7__ Google Inc. __INSTITUTION_8__ DeepMind)• 2019

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 (test)
WER4.5
159
Visual Speech RecognitionLRS3 High-Resource, 433h labelled v1 (test)
WER0.045
80
Audio-Visual Speech RecognitionLRS3 clean (test)
WER4.5
70
Visual Speech RecognitionLRS3
WER0.045
59
Automatic Speech RecognitionLRS3 (test)--
46
Speech RecognitionLRS3-TED
WER33.6
25
Visual Speech RecognitionLRS3 low-resource (test)
WER33.6
20
Lip-readingLRS3 1.0 (test)
WER33.6
19
Automatic Speech RecognitionLRS3 433-hour labeled (test)
WER (%)4.8
19
Automatic Speech RecognitionLRS3 low-resource (test)
WER0.048
18
Showing 10 of 17 rows

Other info

Follow for update