u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
About
While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for multiple speech processing tasks. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input. Codes and models are available at https://github.com/facebookresearch/av_hubert
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Speech Recognition | LRS3 High-Resource, 433h labelled v1 (test) | WER0.012 | 80 | |
| Audio-Visual Speech Recognition | LRS3 clean (test) | WER1.2 | 70 | |
| Automatic Speech Recognition | LRS3 (test) | -- | 46 | |
| Audio-Visual Speech Recognition | LRS-3 Babble noise at 0dB SNR (test) | WER4.6 | 32 | |
| English Transcription | LRS3 Noisy 0-SNR (test) | WER0.046 | 25 | |
| Speech Recognition | LRS3-TED | WER27.2 | 25 | |
| Audio-visual speech-to-text translation | MuAViC (test) | BLEU (EL->EN)14.5 | 23 | |
| Automatic Speech Recognition | LRS3 Clean original (test) | WER1.4 | 21 | |
| Automatic Speech Recognition | LRS3 433-hour labeled (test) | WER (%)1.4 | 19 | |
| Speech Recognition | LRS3 high-resource | WER (V)29.1 | 18 |