Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LASER: Lip Landmark Assisted Speaker Detection for Robustness

About

Active Speaker Detection (ASD) aims to identify who is speaking in complex visual scenes. While humans naturally rely on lip-audio synchronization, existing ASD models often misclassify non-speaking instances when lip movements and audio are unsynchronized. To address this, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER), which explicitly incorporates lip landmarks during training to guide the model's attention to speech-relevant regions. Given a face track, LASER extracts visual features and encodes 2D lip landmarks into dense maps. To handle failure cases such as low resolution or occlusion, we introduce an auxiliary consistency loss that aligns lip-aware and face-only predictions, removing the need for landmark detectors at test time. LASER outperforms state-of-the-art models across both in-domain and out-of-domain benchmarks. To further evaluate robustness in realistic conditions, we introduce LASER-bench, a curated dataset of modern video clips with varying levels of background noise. On the high-noise subset, LASER improves mAP by 3.3 and 4.3 points over LoCoNet and TalkNet, respectively, demonstrating strong resilience to real-world acoustic challenges.

Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee• 2025

Related benchmarks

TaskDatasetResultRank
Active Speaker DetectionAVA-ActiveSpeaker (val)
mAP95.4
107
Active Speaker DetectionTalkies (val)
mAP89.7
14
Active Speaker DetectionASW (val)
mAP89.5
8
Active Speaker DetectionLASER-bench Low Noise
mAP96.4
4
Active Speaker DetectionLASER-bench High Noise
mAP90
4
Showing 5 of 5 rows

Other info

Code

Follow for update