Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model
About
We propose Audio-Visual Lightweight ITerative model (AVLIT), an effective and lightweight neural network that uses Progressive Learning (PL) to perform audio-visual speech separation in noisy environments. To this end, we adopt the Asynchronous Fully Recurrent Convolutional Neural Network (A-FRCNN), which has shown successful results in audio-only speech separation. Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality. We evaluated our model in a controlled environment using the NTCD-TIMIT dataset and in-the-wild using a synthetic dataset that combines LRS3 and WHAM!. The experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines. Furthermore, the reduced footprint of our model makes it suitable for low resource applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-visual speech separation | LRS2-2Mix (test) | SI-SNRi12.8 | 33 | |
| Audio-visual speech separation | LRS3 (test) | SDRi13.6 | 20 | |
| Automatic Speech Recognition | LRS2-2Mix (test) | WER31.85 | 18 | |
| Speech Separation | VoxCeleb2-2Mix (test) | SDRi9.9 | 12 | |
| Speech Separation | LRS3-2Mix (test) | SDRi13.6 | 11 | |
| Audio-visual speech separation | LRS2-3Mix (test) | SI-SNRi10.4 | 8 | |
| Audio-Visual Speaker Separation | LRS3-2Mix (test) | SI-SNRi13.5 | 8 | |
| Audio-visual speech separation | VoxCeleb2 (test) | SI-SNRi9.4 | 7 | |
| Audio-Visual Speaker Separation | VoxCeleb2-2Mix (test) | SI-SNRi9.4 | 7 | |
| Audio-visual speech separation | LRS2-4Mix (test) | SI-SNRi5 | 4 |