LASER: Lip Landmark Assisted Speaker Detection for Robustness

About

Active Speaker Detection (ASD) aims to identify who is speaking in complex visual scenes. While humans naturally rely on lip-audio synchronization, existing ASD models often misclassify non-speaking instances when lip movements and audio are unsynchronized. To address this, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER), which explicitly incorporates lip landmarks during training to guide the model's attention to speech-relevant regions. Given a face track, LASER extracts visual features and encodes 2D lip landmarks into dense maps. To handle failure cases such as low resolution or occlusion, we introduce an auxiliary consistency loss that aligns lip-aware and face-only predictions, removing the need for landmark detectors at test time. LASER outperforms state-of-the-art models across both in-domain and out-of-domain benchmarks. To further evaluate robustness in realistic conditions, we introduce LASER-bench, a curated dataset of modern video clips with varying levels of background noise. On the high-noise subset, LASER improves mAP by 3.3 and 4.3 points over LoCoNet and TalkNet, respectively, demonstrating strong resilience to real-world acoustic challenges.

Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee• 2025

Related benchmarks

Task	Dataset	Result
Active Speaker Detection	AVA-ActiveSpeaker (val)	mAP95.4	123
Active Speaker Detection	Talkies (val)	mAP89.7	14
Active Speaker Detection	ASW (val)	mAP89.5	8
Active Speaker Detection	LASER-bench Low Noise	mAP96.4	4
Active Speaker Detection	LASER-bench High Noise	mAP90	4

Showing 5 of 5 rows

Other info

Code

Follow for update

@wizwand_team Discord