Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality

About

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for multiple speech processing tasks. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input. Codes and models are available at https://github.com/facebookresearch/av_hubert

Wei-Ning Hsu, Bowen Shi• 2022

Related benchmarks

TaskDatasetResultRank
Visual Speech RecognitionLRS3 High-Resource, 433h labelled v1 (test)
WER0.012
80
Audio-Visual Speech RecognitionLRS3 clean (test)
WER1.2
70
Automatic Speech RecognitionLRS3 (test)--
46
Audio-Visual Speech RecognitionLRS-3 Babble noise at 0dB SNR (test)
WER4.6
32
English TranscriptionLRS3 Noisy 0-SNR (test)
WER0.046
25
Speech RecognitionLRS3-TED
WER27.2
25
Audio-visual speech-to-text translationMuAViC (test)
BLEU (EL->EN)14.5
23
Automatic Speech RecognitionLRS3 Clean original (test)
WER1.4
21
Automatic Speech RecognitionLRS3 433-hour labeled (test)
WER (%)1.4
19
Speech RecognitionLRS3 high-resource
WER (V)29.1
18
Showing 10 of 19 rows

Other info

Code

Follow for update