Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

On Robustness to Missing Video for Audiovisual Speech Recognition

About

It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short.

Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan• 2023

Related benchmarks

TaskDatasetResultRank
Audio-Visual Speech RecognitionLRS3 clean (test)
WER1.9
70
Audio-Visual Speech RecognitionLRS-3 Babble noise at 0dB SNR (test)
WER1.9
32
Automatic Speech RecognitionLRS3 Clean original (test)
WER0.9
21
Audio-Visual Speech RecognitionTED LRS3
WER0.009
10
Audio-visual diarizationMEET360
WER24.8
3
Showing 5 of 5 rows

Other info

Follow for update