VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

About

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.

Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, Lirong Dai, Daxin Jiang, Jinyu Li, Furu Wei• 2022

Related benchmarks

Task	Dataset	Result
Visual Speech Recognition	LRS3 (test)	WER2.7	240
Visual Speech Recognition	LRS3 High-Resource, 433h labelled v1 (test)	WER0.012	80
Audio-Visual Speech Recognition	LRS3 (test)	WER1.2	77
Visual Speech Recognition	LRS3	WER0.262	63
Visual Speech Recognition	LRS2	Mean WER24.3	49
Visual Speech Recognition	LRS3 Low-Resource 30h labelled v1 (test)	WER0.027	34
Visual Speech Recognition	LRS3 30h labeled low-resource (test)	WER31.6	28
Automatic Speech Recognition	LRS3 30h labeled low-resource (test)	WER3.6	26
Audio-Visual Speech Recognition	LRS3 30h labeled low-resource (test)	WER2.7	22
Speech Recognition	LRS3 high-resource	WER (V)28.4	18

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord