Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
About
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-Visual Speech Recognition | LRS3 clean (test) | WER1.33 | 70 | |
| Audio-Visual Speech Recognition | LRS-3 Babble noise at 0dB SNR (test) | WER4.5 | 32 | |
| Audio-Visual Speech Recognition | LRS-3 Babble noise at -10dB SNR (test) | WER22.3 | 5 | |
| Audio-Visual Speech Recognition | LRS-3 Babble noise at -5dB SNR (test) | WER11.3 | 5 | |
| Audio-Visual Speech Recognition | LRS-3 Average Babble of noisy SNR levels (-10 to 5dB) (test) | WER10.1 | 5 |