Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scalable Frameworks for Real-World Audio-Visual Speech Recognition

About

The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates computational resources based on the input characteristics. Finally, at the system level, we present methods to expand the system's functionality through modular integration with large-scale foundation models, leveraging their powerful cognitive and generative capabilities to maximize final recognition accuracy. By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.

Sungnyun Kim• 2025

Related benchmarks

TaskDatasetResultRank
Audio-visual speech-to-text translationMuAViC (test)
BLEU (EL->EN)11.4
23
Audio-Visual Speech RecognitionLRS3 (test)--
18
Audio-Visual Speech RecognitionLRS2 50% visual occlusion (test)
WER (Overall)13.2
10
Audio-Visual Speech RecognitionLRS3 (test)
Overall Score10.9
10
Speech RecognitionMuAViC (test)
Arabic Score92.9
9
Audio-Visual Speech RecognitionLRS3 noisy
Average Error Rate4.2
8
Audio-Visual Speech RecognitionLRS3 Object occlusion and noise
WER (Babble, -10 dB)25.8
7
Audio-Visual Speech RecognitionLRS3 Occlusion by hands
WER (Babble, -10 dB)26.6
7
Audio-Visual Speech RecognitionLRS3 Pixelated face
WER (Babble, -10 dB)26
7
Audio-Visual Speech RecognitionLRS3 + DEMAND Object Occlusion + Noise (test)
Error Rate (PARK)2.8
5
Showing 10 of 11 rows

Other info

Follow for update