Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

End-to-end Audio-visual Speech Recognition with Conformers

About

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based language model, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.

Pingchuan Ma, Stavros Petridis, Maja Pantic• 2021

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 (test)
WER2.3
209
Audio-Visual Speech RecognitionLRS3 clean (test)
WER2.3
77
Audio-Visual Speech RecognitionLRS3 (test)
WER2.3
77
Visual-only Speech RecognitionLRS2 (test)
WER37.9
63
Visual Speech RecognitionLRS3
WER0.469
63
Automatic Speech RecognitionLRS3 (test)
WER (%)2.3
58
Speech RecognitionLRS2 (test)
WER3.9
49
Visual Speech RecognitionLRS2
Mean WER39.1
49
Audio-Visual Speech RecognitionLRS2 (test)
WER3.7
34
Lip-readingLRS2 (test)
WER37.9
28
Showing 10 of 31 rows

Other info

Code

Follow for update