Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video

About

Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for image classification tasks. Here, we propose to replace the 3D convolution with a video transformer to extract visual features. We train our baselines and the proposed model on a large scale corpus of YouTube videos. The performance of our approach is evaluated on a labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our best video-only model obtains 31.4% WER on YTDEV18 and 17.0% on LRS3-TED, a 10% and 15% relative improvements over our convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6% WER). In addition, in a series of experiments on multi-person AV-ASR, we obtained an average relative reduction of 2% over our convolutional video frontend.

Dmitriy Serdyuk, Otavio Braga, Olivier Siohan• 2022

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 (test)
WER1.6
159
Visual Speech RecognitionLRS3 High-Resource, 433h labelled v1 (test)
WER0.016
80
Audio-Visual Speech RecognitionLRS3 clean (test)
WER1.6
70
Visual Speech RecognitionLRS3
WER0.016
59
Audio-Visual Speech RecognitionLRS-3 Babble noise at 0dB SNR (test)
WER2.9
32
English TranscriptionLRS3 Noisy 0-SNR (test)
WER0.061
25
Speech RecognitionLRS3-TED
WER17
25
Automatic Speech RecognitionLRS3 Clean original (test)
WER1.6
21
Audio-Visual Speech RecognitionLRS3 (test)
WER1.6
18
Audio-Visual Speech RecognitionTED LRS3
WER0.016
10
Showing 10 of 11 rows

Other info

Follow for update