Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Target Active Speaker Detection with Audio-visual Cues

About

In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.

Yidi Jiang, Ruijie Tao, Zexu Pan, Haizhou Li• 2023

Related benchmarks

TaskDatasetResultRank
Active Speaker DetectionAVA-ActiveSpeaker (val)
mAP97.7
107
Active Speaker DetectionAVA-ActiveSpeaker v1.0 (val)
mAP93.9
27
Active Speaker DetectionWASD (test)
mAP (OC)96.8
9
Active Speaker DetectionAVA-ActiveSpeaker Internal In-Domain (test)
mAP92.7
7
Active Speaker DetectionWASD External/Out-of-Domain (test)
mAP85.7
7
Active Speaker DetectionASW (test)
mAP98.5
5
Showing 6 of 6 rows

Other info

Follow for update