Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

About

Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pre-training method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at https://github.com/ahaliassos/usr.

Alexandros Haliassos, Rodrigo Mira, Honglie Chen, Zoe Landgraf, Stavros Petridis, Maja Pantic• 2024

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 (test)
WER21.5
209
Automatic Speech RecognitionLibrispeech (test-clean)
WER25.3
84
Visual Speech RecognitionLRS3 High-Resource, 433h labelled v1 (test)
WER0.011
80
Audio-Visual Speech RecognitionLRS3 (test)
WER1.1
77
Visual Speech RecognitionLRS3
WER0.011
63
Visual-only Speech RecognitionLRS2 (test)
WER15.4
63
Automatic Speech RecognitionLRS3 (test)
WER (%)1.2
58
Speech RecognitionLRS2 (test)
WER1.9
49
Visual Speech RecognitionLRS3 Low-Resource 30h labelled v1 (test)
WER0.024
34
Audio-Visual Speech RecognitionLRS2 (test)
WER1.9
34
Showing 10 of 26 rows

Other info

Code

Follow for update