VisG AV-HuBERT: Viseme-Guided AV-HuBERT

About

Audio-Visual Speech Recognition (AVSR) systems nowadays integrate Large Language Model (LLM) decoders with transformer-based encoders, achieving state-of-the-art results. However, the relative contributions of improved language modelling versus enhanced audiovisual encoding remain unclear. We propose Viseme-Guided AV-HuBERT (VisG AV-HuBERT), a multi-task fine-tuning framework that incorporates auxiliary viseme classification to strengthen the model's reliance on visual articulatory features. By extending AV-HuBERT with a lightweight viseme prediction sub-network, this method explicitly guides the encoder to preserve visual speech information. Evaluated on LRS3, VisG AV-HuBERT achieves comparable or improved performance over the baseline AV-HuBERT, with notable gains under heavy noise conditions. WER reduces from 13.59% to 6.60% (51.4% relative improvement) at -10 dB Signal-to-Noise Ratio (SNR) for Speech noise. Deeper analysis reveals substantial reductions in substitution errors across noise types, demonstrating improved speech unit discrimination. Evaluation on LRS2 confirms generalization capability. Our results demonstrate that explicit viseme modelling enhances encoder representations, and provides a foundation for enhancing noise-robust AVSR through encoder-level improvements.

Aristeidis Papadopoulos, Rishabh Jain, Naomi Harte• 2026

Related benchmarks

Task	Dataset	Result
Audio-Visual Speech Recognition	LRS3 (test)	--	77
Audio-Visual Speech Recognition	LRS2 (clean)	WER9.925	16
Audio-Visual Speech Recognition	LRS3 (clean)	WER1.38	4
Audio-Visual Speech Recognition	LRS3 Speech noise	CER (-10 dB SNR)5.1	4
Audio-Visual Speech Recognition	LRS3 Music noise	CER (-10 dB)7.11	4
Audio-Visual Speech Recognition	LRS3 Random noise	CER (-10 dB)6.49	4
Audio-Visual Speech Recognition	LRS2 Babble noise	WER (-10 dB)41.65	4
Audio-Visual Speech Recognition	LRS2 Speech noise	WER (-10 dB)24.13	4
Audio-Visual Speech Recognition	LRS2 Music noise	WER (-10 dB SNR)22.76	4
Audio-Visual Speech Recognition	LRS2 Random noise	WER (-10 dB)22.21	4

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord