Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
About
Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-visual speech separation | LRS3 (test) | SDRi18.9 | 29 | |
| Audio-visual speech separation | LRS2 (test) | SDRi16.9 | 23 | |
| Audio-visual speech separation | LRS2 | Parameters (M)6.22 | 18 | |
| Audio-visual speech separation | VoxCeleb2 (test) | SI-SNRi14.6 | 16 | |
| Multi-speaker separation | LRS2-2Mix (test) | SI-SNRi16.8 | 5 | |
| Multi-speaker separation | LRS2-3Mix (test) | SI-SNRi13.1 | 5 | |
| Multi-speaker separation | LRS2-4Mix (test) | SI-SNRi9.7 | 5 | |
| Audio-visual speech separation | VoxCeleb2 Scenario 2: 1 speaker + music noise from FMA | SI-SNRi7.49 | 3 | |
| Audio-visual speech separation | VoxCeleb2 Scenario 4: 1 speaker + music noise + 2 interfering speakers | SI-SNRi4.11 | 3 | |
| Audio-visual speech separation | VoxCeleb2 Scenario 3: 1 speaker + environmental noise + 2 interfering speakers (test) | SI-SNRi5.37 | 3 |