Cross-Modal Bottleneck Fusion For Noise Robust Audio-Visual Speech Recognition

About

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual cues to improve speech recognition under noisy conditions. A central question is how to design a fusion mechanism that allows the model to effectively exploit visual information when the audio signal is degraded, while maintaining strong performance on clean speech. We propose CoBRA (Cross-modal Bottleneck for Robust AVSR), a bottleneck-based fusion framework that introduces a compact set of learnable tokens to mediate cross-modal exchange. By regulating information flow through these tokens, the audio stream can reliably access essential visual cues even under adverse or out-of-domain noise. Despite limited training data, our model surpasses comparable baselines and remains competitive with large-scale systems through noise-adaptive fusion, demonstrating both efficiency and robustness. Ablation studies highlight that the depth of fusion is the most critical factor, underscoring its importance in designing robust AVSR systems.

Seaone Ok, Min Jun Choi, Eungbeom Kim, Seungu Han, Kyogu Lee• 2026

Related benchmarks

Task	Dataset	Result	Rank
Audio-Visual Speech Recognition	LRS3 (test)	WER1.6		77
Audio-Visual Speech Recognition	LRS2 (test)	WER2.8		34

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord