Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VisG AV-HuBERT: Viseme-Guided AV-HuBERT

About

Audio-Visual Speech Recognition (AVSR) systems nowadays integrate Large Language Model (LLM) decoders with transformer-based encoders, achieving state-of-the-art results. However, the relative contributions of improved language modelling versus enhanced audiovisual encoding remain unclear. We propose Viseme-Guided AV-HuBERT (VisG AV-HuBERT), a multi-task fine-tuning framework that incorporates auxiliary viseme classification to strengthen the model's reliance on visual articulatory features. By extending AV-HuBERT with a lightweight viseme prediction sub-network, this method explicitly guides the encoder to preserve visual speech information. Evaluated on LRS3, VisG AV-HuBERT achieves comparable or improved performance over the baseline AV-HuBERT, with notable gains under heavy noise conditions. WER reduces from 13.59% to 6.60% (51.4% relative improvement) at -10 dB Signal-to-Noise Ratio (SNR) for Speech noise. Deeper analysis reveals substantial reductions in substitution errors across noise types, demonstrating improved speech unit discrimination. Evaluation on LRS2 confirms generalization capability. Our results demonstrate that explicit viseme modelling enhances encoder representations, and provides a foundation for enhancing noise-robust AVSR through encoder-level improvements.

Aristeidis Papadopoulos, Rishabh Jain, Naomi Harte• 2026

Related benchmarks

TaskDatasetResultRank
Audio-Visual Speech RecognitionLRS3 (test)--
77
Audio-Visual Speech RecognitionLRS2 (clean)
WER9.925
16
Audio-Visual Speech RecognitionLRS3 (clean)
WER1.38
4
Audio-Visual Speech RecognitionLRS3 Speech noise
CER (-10 dB SNR)5.1
4
Audio-Visual Speech RecognitionLRS3 Music noise
CER (-10 dB)7.11
4
Audio-Visual Speech RecognitionLRS3 Random noise
CER (-10 dB)6.49
4
Audio-Visual Speech RecognitionLRS2 Babble noise
WER (-10 dB)41.65
4
Audio-Visual Speech RecognitionLRS2 Speech noise
WER (-10 dB)24.13
4
Audio-Visual Speech RecognitionLRS2 Music noise
WER (-10 dB SNR)22.76
4
Audio-Visual Speech RecognitionLRS2 Random noise
WER (-10 dB)22.21
4
Showing 10 of 15 rows

Other info

Follow for update