Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

About

Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng• 2022

Related benchmarks

Task	Dataset	Result
Audio-Visual Speech Recognition	LRS3 clean (test)	WER1.33	77
Audio-Visual Speech Recognition	LRS-3 Babble noise at 0dB SNR (test)	WER4.5	32
Audio-Visual Speech Recognition	LRS-3 Babble noise at -10dB SNR (test)	WER22.3	5
Audio-Visual Speech Recognition	LRS-3 Babble noise at -5dB SNR (test)	WER11.3	5
Audio-Visual Speech Recognition	LRS-3 Average Babble of noisy SNR levels (-10 to 5dB) (test)	WER10.1	5

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord