Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

About

Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.

Kai Li, Kejun Gao, Xiaolin Hu• 2025

Related benchmarks

TaskDatasetResultRank
Audio-visual speech separationLRS3 (test)
SDRi18.9
29
Audio-visual speech separationLRS2 (test)
SDRi16.9
23
Audio-visual speech separationLRS2
Parameters (M)6.22
18
Audio-visual speech separationVoxCeleb2 (test)
SI-SNRi14.6
16
Multi-speaker separationLRS2-2Mix (test)
SI-SNRi16.8
5
Multi-speaker separationLRS2-3Mix (test)
SI-SNRi13.1
5
Multi-speaker separationLRS2-4Mix (test)
SI-SNRi9.7
5
Audio-visual speech separationVoxCeleb2 Scenario 2: 1 speaker + music noise from FMA
SI-SNRi7.49
3
Audio-visual speech separationVoxCeleb2 Scenario 4: 1 speaker + music noise + 2 interfering speakers
SI-SNRi4.11
3
Audio-visual speech separationVoxCeleb2 Scenario 3: 1 speaker + environmental noise + 2 interfering speakers (test)
SI-SNRi5.37
3
Showing 10 of 12 rows

Other info

Follow for update