Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

About

Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.

Kai Li, Kejun Gao, Xiaolin Hu• 2025

Related benchmarks

Task	Dataset	Result
Audio-visual speech separation	LRS3 (test)	SDRi18.9	29
Audio-visual speech separation	LRS2 (test)	SDRi16.9	23
Audio-visual speech separation	LRS2	Parameters (M)6.22	18
Audio-visual speech separation	VoxCeleb2 (test)	SI-SNRi14.6	16
Multi-speaker separation	LRS2-2Mix (test)	SI-SNRi16.8	5
Multi-speaker separation	LRS2-3Mix (test)	SI-SNRi13.1	5
Multi-speaker separation	LRS2-4Mix (test)	SI-SNRi9.7	5
Audio-visual speech separation	VoxCeleb2 Scenario 2: 1 speaker + music noise from FMA	SI-SNRi7.49	3
Audio-visual speech separation	VoxCeleb2 Scenario 4: 1 speaker + music noise + 2 interfering speakers	SI-SNRi4.11	3
Audio-visual speech separation	VoxCeleb2 Scenario 3: 1 speaker + environmental noise + 2 interfering speakers (test)	SI-SNRi5.37	3

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord